RFR (XS) CR 8014233: java.lang.Thread should be @Contended

Thu May 9 14:19:55 UTC 2013

Hi Aleksey,

Wouldn't it be even better if just threadLocalRandom* fields were 
annotated with @Contended("ThreadLocal") ?
Some fields within the Thread object are accessed from non-local 
threads. I don't know how frequently, but isolating just 
threadLocalRandom* fields from all possible false-sharing scenarios 
would seem even better, no?

Regards, Peter

On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
> Hi,
>
> This is from our backlog after JDK-8005926. After ThreadLocalRandom
> state was merged into Thread, we now have to deal with the false sharing
> induced by heavily-updated fields in Thread. TLR was padded before, and
> it should make sense to make Thread bear @Contended annotation to
> isolate its fields in the same manner.
>
> The webrev is here:
>     http://cr.openjdk.java.net/~shade/8014233/webrev.00/
>
> Testing:
>   - microbenchmarks (see below)
>   - JPRT cycle against jdk8-tl
>
> The extended rationale for the change follows.
>
> If we look at the current Thread layout, we can see the TLR state is
> buried within the Thread instance. TLR state are by far the mostly
> updated fields in Thread now:
>
>> Running 64-bit HotSpot VM.
>> Using compressed references with 3-bit shift.
>> Objects are 8 bytes aligned.
>>
>> java.lang.Thread
>>    offset  size                     type description
>>         0    12                          (assumed to be the object header + first field alignment)
>>        12     4                      int Thread.priority
>>        16     8                     long Thread.eetop
>>        24     8                     long Thread.stackSize
>>        32     8                     long Thread.nativeParkEventPointer
>>        40     8                     long Thread.tid
>>        48     8                     long Thread.threadLocalRandomSeed
>>        56     4                      int Thread.threadStatus
>>        60     4                      int Thread.threadLocalRandomProbe
>>        64     4                      int Thread.threadLocalRandomSecondarySeed
>>        68     1                  boolean Thread.single_step
>>        69     1                  boolean Thread.daemon
>>        70     1                  boolean Thread.stillborn
>>        71     1                          (alignment/padding gap)
>>        72     4                   char[] Thread.name
>>        76     4                   Thread Thread.threadQ
>>        80     4                 Runnable Thread.target
>>        84     4              ThreadGroup Thread.group
>>        88     4              ClassLoader Thread.contextClassLoader
>>        92     4     AccessControlContext Thread.inheritedAccessControlContext
>>        96     4           ThreadLocalMap Thread.threadLocals
>>       100     4           ThreadLocalMap Thread.inheritableThreadLocals
>>       104     4                   Object Thread.parkBlocker
>>       108     4            Interruptible Thread.blocker
>>       112     4                   Object Thread.blockerLock
>>       116     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
>>       120                                (object boundary, size estimate)
>>   VM reports 120 bytes per instance
>
> Assuming current x86 hardware with 64-byte cache line sizes and current
> class layout, we can see the trailing fields in Thread are providing
> enough insulation from the false sharing with an adjacent object. Also,
> the Thread itself is large enough so that two TLRs belonging to
> different threads will not collide.
>
> However the leading fields are not enough: we have a few words which can
> occupy the same cache line, but belong to another object. This is where
> things can get worse in two ways: a) the TLR update can make the field
> access in adjacent object considerably slower; and much worse b) the
> update in the adjacent field can disturb the TLR state, which is
> critical for j.u.concurrent performance relying heavily on fast TLR.
>
> To illustrate both points, there is a simple benchmark driven by JMH
> (http://openjdk.java.net/projects/code-tools/jmh/):
>    http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
>
> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
> Thread with/without @Contended that microbenchmark yields the following
> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>
> Accessing ThreadLocalRandom.current().nextInt():
>    baseline:    932 +-  4 ops/usec
>    @Contended:  927 +- 10 ops/usec
>
> Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
>    baseline:    454 +-  2 ops/usec
>    @Contended:  490 +-  3 ops/usec
>
> One might note the $uncaughtExceptionHandler is the trailing field in
> the Thread, so it can naturally be false-shared with the adjacent
> thread's TLR. We had chosen this as the illustration, in real examples
> with multitude objects on the heap, we can get another contender.
>
> So that is ~10% performance hit on false sharing even on very small
> machine. Translating it back: having heavily-updated field in the object
> adjacent to Thread can bring these overheads to TLR, and then jeopardize
> j.u.c performance.
>
> Of course, as soon as status quo about field layout is changed, we might
> start to lose spectacularly. I would recommend we deal with this now, so
> less surprises come in the future.
>
> The caveat is that we are wasting some of the space per Thread instance.
> After the patch, we layout is:
>
>> java.lang.Thread
>>   offset  size                     type description
>>        0    12                          (assumed to be the object header + first field alignment)
>>       12   128                          (alignment/padding gap)
>>      140     4                      int Thread.priority
>>      144     8                     long Thread.eetop
>>      152     8                     long Thread.stackSize
>>      160     8                     long Thread.nativeParkEventPointer
>>      168     8                     long Thread.tid
>>      176     8                     long Thread.threadLocalRandomSeed
>>      184     4                      int Thread.threadStatus
>>      188     4                      int Thread.threadLocalRandomProbe
>>      192     4                      int Thread.threadLocalRandomSecondarySeed
>>      196     1                  boolean Thread.single_step
>>      197     1                  boolean Thread.daemon
>>      198     1                  boolean Thread.stillborn
>>      199     1                          (alignment/padding gap)
>>      200     4                   char[] Thread.name
>>      204     4                   Thread Thread.threadQ
>>      208     4                 Runnable Thread.target
>>      212     4              ThreadGroup Thread.group
>>      216     4              ClassLoader Thread.contextClassLoader
>>      220     4     AccessControlContext Thread.inheritedAccessControlContext
>>      224     4           ThreadLocalMap Thread.threadLocals
>>      228     4           ThreadLocalMap Thread.inheritableThreadLocals
>>      232     4                   Object Thread.parkBlocker
>>      236     4            Interruptible Thread.blocker
>>      240     4                   Object Thread.blockerLock
>>      244     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
>>      248                                (object boundary, size estimate)
>> VM reports 376 bytes per instance
> ...and we have additional 256 bytes per Thread (twice the
> -XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the
> space wasted in native memory for each thread, especially stack areas.
>
> Thanks,
> Aleksey.