RFR (XS) CR 8014233: java.lang.Thread should be @Contended
Laurent Bourgès
bourges.laurent at gmail.com
Thu May 9 14:59:43 UTC 2013
Hi all,
A stupid question:
any ThreadLocal subclass should be marked @Contended to be sure that false
sharing never happens between ThreadLocal instance and any other object on
the heap ?
Laurent
2013/5/9 Peter Levart <peter.levart at gmail.com>
> Hi Aleksey,
>
> Wouldn't it be even better if just threadLocalRandom* fields were
> annotated with @Contended("ThreadLocal") ?
> Some fields within the Thread object are accessed from non-local threads.
> I don't know how frequently, but isolating just threadLocalRandom* fields
> from all possible false-sharing scenarios would seem even better, no?
>
> Regards, Peter
>
>
> On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>
>> Hi,
>>
>> This is from our backlog after JDK-8005926. After ThreadLocalRandom
>> state was merged into Thread, we now have to deal with the false sharing
>> induced by heavily-updated fields in Thread. TLR was padded before, and
>> it should make sense to make Thread bear @Contended annotation to
>> isolate its fields in the same manner.
>>
>> The webrev is here:
>> http://cr.openjdk.java.net/~**shade/8014233/webrev.00/<http://cr.openjdk.java.net/~shade/8014233/webrev.00/>
>>
>> Testing:
>> - microbenchmarks (see below)
>> - JPRT cycle against jdk8-tl
>>
>> The extended rationale for the change follows.
>>
>> If we look at the current Thread layout, we can see the TLR state is
>> buried within the Thread instance. TLR state are by far the mostly
>> updated fields in Thread now:
>>
>> Running 64-bit HotSpot VM.
>>> Using compressed references with 3-bit shift.
>>> Objects are 8 bytes aligned.
>>>
>>> java.lang.Thread
>>> offset size type description
>>> 0 12 (assumed to be the object
>>> header + first field alignment)
>>> 12 4 int Thread.priority
>>> 16 8 long Thread.eetop
>>> 24 8 long Thread.stackSize
>>> 32 8 long Thread.nativeParkEventPointer
>>> 40 8 long Thread.tid
>>> 48 8 long Thread.threadLocalRandomSeed
>>> 56 4 int Thread.threadStatus
>>> 60 4 int Thread.threadLocalRandomProbe
>>> 64 4 int Thread.**
>>> threadLocalRandomSecondarySeed
>>> 68 1 boolean Thread.single_step
>>> 69 1 boolean Thread.daemon
>>> 70 1 boolean Thread.stillborn
>>> 71 1 (alignment/padding gap)
>>> 72 4 char[] Thread.name
>>> 76 4 Thread Thread.threadQ
>>> 80 4 Runnable Thread.target
>>> 84 4 ThreadGroup Thread.group
>>> 88 4 ClassLoader Thread.contextClassLoader
>>> 92 4 AccessControlContext Thread.**
>>> inheritedAccessControlContext
>>> 96 4 ThreadLocalMap Thread.threadLocals
>>> 100 4 ThreadLocalMap Thread.inheritableThreadLocals
>>> 104 4 Object Thread.parkBlocker
>>> 108 4 Interruptible Thread.blocker
>>> 112 4 Object Thread.blockerLock
>>> 116 4 UncaughtExceptionHandler Thread.**
>>> uncaughtExceptionHandler
>>> 120 (object boundary, size estimate)
>>> VM reports 120 bytes per instance
>>>
>>
>> Assuming current x86 hardware with 64-byte cache line sizes and current
>> class layout, we can see the trailing fields in Thread are providing
>> enough insulation from the false sharing with an adjacent object. Also,
>> the Thread itself is large enough so that two TLRs belonging to
>> different threads will not collide.
>>
>> However the leading fields are not enough: we have a few words which can
>> occupy the same cache line, but belong to another object. This is where
>> things can get worse in two ways: a) the TLR update can make the field
>> access in adjacent object considerably slower; and much worse b) the
>> update in the adjacent field can disturb the TLR state, which is
>> critical for j.u.concurrent performance relying heavily on fast TLR.
>>
>> To illustrate both points, there is a simple benchmark driven by JMH
>> (http://openjdk.java.net/**projects/code-tools/jmh/<http://openjdk.java.net/projects/code-tools/jmh/>
>> ):
>> http://cr.openjdk.java.net/~**shade/8014233/threadbench.zip<http://cr.openjdk.java.net/~shade/8014233/threadbench.zip>
>>
>> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
>> Thread with/without @Contended that microbenchmark yields the following
>> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>>
>> Accessing ThreadLocalRandom.current().**nextInt():
>> baseline: 932 +- 4 ops/usec
>> @Contended: 927 +- 10 ops/usec
>>
>> Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
>> baseline: 454 +- 2 ops/usec
>> @Contended: 490 +- 3 ops/usec
>>
>> One might note the $uncaughtExceptionHandler is the trailing field in
>> the Thread, so it can naturally be false-shared with the adjacent
>> thread's TLR. We had chosen this as the illustration, in real examples
>> with multitude objects on the heap, we can get another contender.
>>
>> So that is ~10% performance hit on false sharing even on very small
>> machine. Translating it back: having heavily-updated field in the object
>> adjacent to Thread can bring these overheads to TLR, and then jeopardize
>> j.u.c performance.
>>
>> Of course, as soon as status quo about field layout is changed, we might
>> start to lose spectacularly. I would recommend we deal with this now, so
>> less surprises come in the future.
>>
>> The caveat is that we are wasting some of the space per Thread instance.
>> After the patch, we layout is:
>>
>> java.lang.Thread
>>> offset size type description
>>> 0 12 (assumed to be the object header
>>> + first field alignment)
>>> 12 128 (alignment/padding gap)
>>> 140 4 int Thread.priority
>>> 144 8 long Thread.eetop
>>> 152 8 long Thread.stackSize
>>> 160 8 long Thread.nativeParkEventPointer
>>> 168 8 long Thread.tid
>>> 176 8 long Thread.threadLocalRandomSeed
>>> 184 4 int Thread.threadStatus
>>> 188 4 int Thread.threadLocalRandomProbe
>>> 192 4 int Thread.**
>>> threadLocalRandomSecondarySeed
>>> 196 1 boolean Thread.single_step
>>> 197 1 boolean Thread.daemon
>>> 198 1 boolean Thread.stillborn
>>> 199 1 (alignment/padding gap)
>>> 200 4 char[] Thread.name
>>> 204 4 Thread Thread.threadQ
>>> 208 4 Runnable Thread.target
>>> 212 4 ThreadGroup Thread.group
>>> 216 4 ClassLoader Thread.contextClassLoader
>>> 220 4 AccessControlContext Thread.**
>>> inheritedAccessControlContext
>>> 224 4 ThreadLocalMap Thread.threadLocals
>>> 228 4 ThreadLocalMap Thread.inheritableThreadLocals
>>> 232 4 Object Thread.parkBlocker
>>> 236 4 Interruptible Thread.blocker
>>> 240 4 Object Thread.blockerLock
>>> 244 4 UncaughtExceptionHandler Thread.**
>>> uncaughtExceptionHandler
>>> 248 (object boundary, size estimate)
>>> VM reports 376 bytes per instance
>>>
>> ...and we have additional 256 bytes per Thread (twice the
>> -XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the
>> space wasted in native memory for each thread, especially stack areas.
>>
>> Thanks,
>> Aleksey.
>>
>
>
More information about the core-libs-dev
mailing list