RFR (XS) CR 8014233: java.lang.Thread should be @Contended

Thu May 9 14:59:43 UTC 2013

Hi all,

A stupid question:
any ThreadLocal subclass should be marked @Contended to be sure that false
sharing never happens between ThreadLocal instance and any other object on
the heap ?

Laurent

2013/5/9 Peter Levart <peter.levart at gmail.com>

> Hi Aleksey,
>
> Wouldn't it be even better if just threadLocalRandom* fields were
> annotated with @Contended("ThreadLocal") ?
> Some fields within the Thread object are accessed from non-local threads.
> I don't know how frequently, but isolating just threadLocalRandom* fields
> from all possible false-sharing scenarios would seem even better, no?
>
> Regards, Peter
>
>
> On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>
>> Hi,
>>
>> This is from our backlog after JDK-8005926. After ThreadLocalRandom
>> state was merged into Thread, we now have to deal with the false sharing
>> induced by heavily-updated fields in Thread. TLR was padded before, and
>> it should make sense to make Thread bear @Contended annotation to
>> isolate its fields in the same manner.
>>
>> The webrev is here:
>>     http://cr.openjdk.java.net/~**shade/8014233/webrev.00/<http://cr.openjdk.java.net/~shade/8014233/webrev.00/>
>>
>> Testing:
>>   - microbenchmarks (see below)
>>   - JPRT cycle against jdk8-tl
>>
>> The extended rationale for the change follows.
>>
>> If we look at the current Thread layout, we can see the TLR state is
>> buried within the Thread instance. TLR state are by far the mostly
>> updated fields in Thread now:
>>
>>  Running 64-bit HotSpot VM.
>>> Using compressed references with 3-bit shift.
>>> Objects are 8 bytes aligned.
>>>
>>> java.lang.Thread
>>>    offset  size                     type description
>>>         0    12                          (assumed to be the object
>>> header + first field alignment)
>>>        12     4                      int Thread.priority
>>>        16     8                     long Thread.eetop
>>>        24     8                     long Thread.stackSize
>>>        32     8                     long Thread.nativeParkEventPointer
>>>        40     8                     long Thread.tid
>>>        48     8                     long Thread.threadLocalRandomSeed
>>>        56     4                      int Thread.threadStatus
>>>        60     4                      int Thread.threadLocalRandomProbe
>>>        64     4                      int Thread.**
>>> threadLocalRandomSecondarySeed
>>>        68     1                  boolean Thread.single_step
>>>        69     1                  boolean Thread.daemon
>>>        70     1                  boolean Thread.stillborn
>>>        71     1                          (alignment/padding gap)
>>>        72     4                   char[] Thread.name
>>>        76     4                   Thread Thread.threadQ
>>>        80     4                 Runnable Thread.target
>>>        84     4              ThreadGroup Thread.group
>>>        88     4              ClassLoader Thread.contextClassLoader
>>>        92     4     AccessControlContext Thread.**
>>> inheritedAccessControlContext
>>>        96     4           ThreadLocalMap Thread.threadLocals
>>>       100     4           ThreadLocalMap Thread.inheritableThreadLocals
>>>       104     4                   Object Thread.parkBlocker
>>>       108     4            Interruptible Thread.blocker
>>>       112     4                   Object Thread.blockerLock
>>>       116     4 UncaughtExceptionHandler Thread.**
>>> uncaughtExceptionHandler
>>>       120                                (object boundary, size estimate)
>>>   VM reports 120 bytes per instance
>>>
>>
>> Assuming current x86 hardware with 64-byte cache line sizes and current
>> class layout, we can see the trailing fields in Thread are providing
>> enough insulation from the false sharing with an adjacent object. Also,
>> the Thread itself is large enough so that two TLRs belonging to
>> different threads will not collide.
>>
>> However the leading fields are not enough: we have a few words which can
>> occupy the same cache line, but belong to another object. This is where
>> things can get worse in two ways: a) the TLR update can make the field
>> access in adjacent object considerably slower; and much worse b) the
>> update in the adjacent field can disturb the TLR state, which is
>> critical for j.u.concurrent performance relying heavily on fast TLR.
>>
>> To illustrate both points, there is a simple benchmark driven by JMH
>> (http://openjdk.java.net/**projects/code-tools/jmh/<http://openjdk.java.net/projects/code-tools/jmh/>
>> ):
>>    http://cr.openjdk.java.net/~**shade/8014233/threadbench.zip<http://cr.openjdk.java.net/~shade/8014233/threadbench.zip>
>>
>> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
>> Thread with/without @Contended that microbenchmark yields the following
>> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>>
>> Accessing ThreadLocalRandom.current().**nextInt():
>>    baseline:    932 +-  4 ops/usec
>>    @Contended:  927 +- 10 ops/usec
>>
>> Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
>>    baseline:    454 +-  2 ops/usec
>>    @Contended:  490 +-  3 ops/usec
>>
>> One might note the $uncaughtExceptionHandler is the trailing field in
>> the Thread, so it can naturally be false-shared with the adjacent
>> thread's TLR. We had chosen this as the illustration, in real examples
>> with multitude objects on the heap, we can get another contender.
>>
>> So that is ~10% performance hit on false sharing even on very small
>> machine. Translating it back: having heavily-updated field in the object
>> adjacent to Thread can bring these overheads to TLR, and then jeopardize
>> j.u.c performance.
>>
>> Of course, as soon as status quo about field layout is changed, we might
>> start to lose spectacularly. I would recommend we deal with this now, so
>> less surprises come in the future.
>>
>> The caveat is that we are wasting some of the space per Thread instance.
>> After the patch, we layout is:
>>
>>  java.lang.Thread
>>>   offset  size                     type description
>>>        0    12                          (assumed to be the object header
>>> + first field alignment)
>>>       12   128                          (alignment/padding gap)
>>>      140     4                      int Thread.priority
>>>      144     8                     long Thread.eetop
>>>      152     8                     long Thread.stackSize
>>>      160     8                     long Thread.nativeParkEventPointer
>>>      168     8                     long Thread.tid
>>>      176     8                     long Thread.threadLocalRandomSeed
>>>      184     4                      int Thread.threadStatus
>>>      188     4                      int Thread.threadLocalRandomProbe
>>>      192     4                      int Thread.**
>>> threadLocalRandomSecondarySeed
>>>      196     1                  boolean Thread.single_step
>>>      197     1                  boolean Thread.daemon
>>>      198     1                  boolean Thread.stillborn
>>>      199     1                          (alignment/padding gap)
>>>      200     4                   char[] Thread.name
>>>      204     4                   Thread Thread.threadQ
>>>      208     4                 Runnable Thread.target
>>>      212     4              ThreadGroup Thread.group
>>>      216     4              ClassLoader Thread.contextClassLoader
>>>      220     4     AccessControlContext Thread.**
>>> inheritedAccessControlContext
>>>      224     4           ThreadLocalMap Thread.threadLocals
>>>      228     4           ThreadLocalMap Thread.inheritableThreadLocals
>>>      232     4                   Object Thread.parkBlocker
>>>      236     4            Interruptible Thread.blocker
>>>      240     4                   Object Thread.blockerLock
>>>      244     4 UncaughtExceptionHandler Thread.**
>>> uncaughtExceptionHandler
>>>      248                                (object boundary, size estimate)
>>> VM reports 376 bytes per instance
>>>
>> ...and we have additional 256 bytes per Thread (twice the
>> -XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the
>> space wasted in native memory for each thread, especially stack areas.
>>
>> Thanks,
>> Aleksey.
>>
>
>