RFR (XS) CR 8014233: java.lang.Thread should have @Contended on TLR fields
Aleksey Shipilev
aleksey.shipilev at oracle.com
Tue Jun 18 06:56:30 UTC 2013
Hi David,
It depends on the scenario we are assessing. For the sake of argument,
let's say every thread had requested TLR.current() at least once.
Before the merge:
Thread maps for ThreadLocal =~ 32 bytes x #threads
TLR instances + padding =~ (128 + 8?) bytes x #threads
After the merge:
TLR fields in Thread + padding =~ (2x128 + 16) x #threads
So, there is the additional footprint cost per Thread; but that seems
abysmal comparing to what native thread already allocates for its native
structures (e.g. stack). Note that @Contended does larger padding
anticipating the hardware prefetchers also turned on (VM can get better
at this though).
Gory details:
**** -XX:-EnableContended: ****
Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Objects are 8 bytes aligned.
java.lang.Thread
offset size type description
0 12 (assumed to be the object header
+ first field alignment)
12 4 int Thread.priority
16 8 long Thread.eetop
24 8 long Thread.stackSize
32 8 long Thread.nativeParkEventPointer
40 8 long Thread.tid
48 8 long Thread.threadLocalRandomSeed
56 4 int Thread.threadStatus
60 4 int Thread.threadLocalRandomProbe
64 4 int Thread.threadLocalRandomSecondarySeed
68 1 boolean Thread.single_step
69 1 boolean Thread.daemon
70 1 boolean Thread.stillborn
71 1 (alignment/padding gap)
72 4 char[] Thread.name
76 4 Thread Thread.threadQ
80 4 Runnable Thread.target
84 4 ThreadGroup Thread.group
88 4 ClassLoader Thread.contextClassLoader
92 4 AccessControlContext Thread.inheritedAccessControlContext
96 4 ThreadLocalMap Thread.threadLocals
100 4 ThreadLocalMap Thread.inheritableThreadLocals
104 4 Object Thread.parkBlocker
108 4 Interruptible Thread.blocker
112 4 Object Thread.blockerLock
116 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
120 (object boundary, size estimate)
VM reports 120 bytes per instance
**** -XX:+EnableContended: ****
Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Objects are 8 bytes aligned.
java.lang.Thread
offset size type description
0 12 (assumed to be the object header
+ first field alignment)
12 4 int Thread.priority
16 8 long Thread.eetop
24 8 long Thread.stackSize
32 8 long Thread.nativeParkEventPointer
40 8 long Thread.tid
48 4 int Thread.threadStatus
52 1 boolean Thread.single_step
53 1 boolean Thread.daemon
54 1 boolean Thread.stillborn
55 1 (alignment/padding gap)
56 4 char[] Thread.name
60 4 Thread Thread.threadQ
64 4 Runnable Thread.target
68 4 ThreadGroup Thread.group
72 4 ClassLoader Thread.contextClassLoader
76 4 AccessControlContext Thread.inheritedAccessControlContext
80 4 ThreadLocalMap Thread.threadLocals
84 4 ThreadLocalMap Thread.inheritableThreadLocals
88 4 Object Thread.parkBlocker
92 4 Interruptible Thread.blocker
96 4 Object Thread.blockerLock
100 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
104 128 (alignment/padding gap)
232 8 long Thread.threadLocalRandomSeed
240 4 int Thread.threadLocalRandomProbe
244 4 int Thread.threadLocalRandomSecondarySeed
248 (object boundary, size estimate)
VM reports 376 bytes per instance
-Aleksey.
On 06/18/2013 06:03 AM, David Holmes wrote:
> Hi Aleksey,
>
> What is the overall change in memory use for this set of changes ie what
> did we use pre TLR merging and what do we use now?
>
> Thanks,
> David
>
> On 17/06/2013 7:00 PM, Aleksey Shipilev wrote:
>> Hi,
>>
>> This is the respin of the RFE filed a month ago:
>>
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2013-May/016754.html
>>
>> The webrev is here:
>> http://cr.openjdk.java.net/~shade/8014233/webrev.02/
>>
>> Testing:
>> - JPRT build passes
>> - Linux x86_64/release passes jdk/java/lang jtreg
>> - vm.quick.testlist, vm.quick-gc.testlist on selected platforms
>> - microbenchmarks, see below
>>
>> The rationale follows.
>>
>> After we merged ThreadLocalRandom state in the thread, we are now
>> missing the padding to prevent false sharing on those heavily-updated
>> fields. While the Thread is already large enough to separate two TLR
>> states for two distinct threads, we can still get the false sharing with
>> other thread fields.
>>
>> There is the benchmark showcasing this:
>> http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
>>
>> There are two test cases: first one is only calling its own TLR with
>> nextInt() and then the current thread's ID, another test calls *another*
>> thread ID, thus inducing the false sharing against another thread's TLR
>> state.
>>
>> On my 2x2 i5 laptop, running Linux x86_64:
>> same: 355 +- 1 ops/usec
>> other: 100 +- 5 ops/usec
>>
>> Note the decrease in throughput because of the false sharing.
>>
>> With the patch:
>> same: 359 +- 1 ops/usec
>> other: 356 +- 1 ops/usec
>>
>> Note the performance is back. We want to evade these spurious decreases
>> in performance, due to either unlucky memory layout, or the user code
>> (un)intentionally ruining the cache line locality for the updater thread.
>>
>> Thanks,
>> -Aleksey.
>>
More information about the core-libs-dev
mailing list