RFR (XS) CR 8014233: java.lang.Thread should be @Contended

Fri May 10 06:31:25 UTC 2013

Peter,

you're absolutely right: I was thinking about thread local values (object
instances) and not ThreadLocal keys !
I think ThreadLocal name is confusing as it does not correspond to values !

Several times I wonder if false sharing can happen between my thread local
values (i.e. different Thread context classes) and any other object
including other Thread contexts).

Is the GC (old gen) able to place objects in thread dedicated area: it
would so avoid any false sharing between object graphs dedicated to each
thread = thread isolation.

I think that TLAB does so for allocation / short lived objects but for the
old generation (long lived objects) it is not the case: maybe G1 can
provide different partitioning and maybe take into acccount the thread
affinity ?

Laurent

2013/5/9 Peter Levart <peter.levart at gmail.com>

>
> On 05/09/2013 04:59 PM, Laurent Bourgès wrote:
>
> Hi all,
>
> A stupid question:
> any ThreadLocal subclass should be marked @Contended to be sure that false
> sharing never happens between ThreadLocal instance and any other object on
> the heap ?
>
>
> Hi Laurent,
>
> ThreadLocal object is just a key (into a ThreadLocalMap). It's usually not
> subclassed to add any state but to override initialValue method.
> ThreadLocal contains a single final field 'threadLocalHashCode', which is
> read at every call to ThreadLocal.get() (usually by multiple threads). This
> can contend with a frequent write of a field in some other object, placed
> into it's proximity, yes, but I don't think we should put @Contended on
> every class that has frequently read fields. @Contended should be reserved
> for classes with fields that are frequently written, if I understand the
> concept correctly.
>
> Regards, Peter
>
>
> Laurent
>
>  2013/5/9 Peter Levart <peter.levart at gmail.com>
>
>> Hi Aleksey,
>>
>> Wouldn't it be even better if just threadLocalRandom* fields were
>> annotated with @Contended("ThreadLocal") ?
>> Some fields within the Thread object are accessed from non-local threads.
>> I don't know how frequently, but isolating just threadLocalRandom* fields
>> from all possible false-sharing scenarios would seem even better, no?
>>
>> Regards, Peter
>>
>>
>> On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>>
>>> Hi,
>>>
>>> This is from our backlog after JDK-8005926. After ThreadLocalRandom
>>> state was merged into Thread, we now have to deal with the false sharing
>>> induced by heavily-updated fields in Thread. TLR was padded before, and
>>> it should make sense to make Thread bear @Contended annotation to
>>> isolate its fields in the same manner.
>>>
>>> The webrev is here:
>>>     http://cr.openjdk.java.net/~shade/8014233/webrev.00/
>>>
>>> Testing:
>>>   - microbenchmarks (see below)
>>>   - JPRT cycle against jdk8-tl
>>>
>>> The extended rationale for the change follows.
>>>
>>> If we look at the current Thread layout, we can see the TLR state is
>>> buried within the Thread instance. TLR state are by far the mostly
>>> updated fields in Thread now:
>>>
>>>  Running 64-bit HotSpot VM.
>>>> Using compressed references with 3-bit shift.
>>>> Objects are 8 bytes aligned.
>>>>
>>>> java.lang.Thread
>>>>    offset  size                     type description
>>>>         0    12                          (assumed to be the object
>>>> header + first field alignment)
>>>>        12     4                      int Thread.priority
>>>>        16     8                     long Thread.eetop
>>>>        24     8                     long Thread.stackSize
>>>>        32     8                     long Thread.nativeParkEventPointer
>>>>        40     8                     long Thread.tid
>>>>        48     8                     long Thread.threadLocalRandomSeed
>>>>        56     4                      int Thread.threadStatus
>>>>        60     4                      int Thread.threadLocalRandomProbe
>>>>        64     4                      int
>>>> Thread.threadLocalRandomSecondarySeed
>>>>        68     1                  boolean Thread.single_step
>>>>        69     1                  boolean Thread.daemon
>>>>        70     1                  boolean Thread.stillborn
>>>>        71     1                          (alignment/padding gap)
>>>>        72     4                   char[] Thread.name
>>>>        76     4                   Thread Thread.threadQ
>>>>        80     4                 Runnable Thread.target
>>>>        84     4              ThreadGroup Thread.group
>>>>        88     4              ClassLoader Thread.contextClassLoader
>>>>        92     4     AccessControlContext
>>>> Thread.inheritedAccessControlContext
>>>>        96     4           ThreadLocalMap Thread.threadLocals
>>>>       100     4           ThreadLocalMap Thread.inheritableThreadLocals
>>>>       104     4                   Object Thread.parkBlocker
>>>>       108     4            Interruptible Thread.blocker
>>>>       112     4                   Object Thread.blockerLock
>>>>       116     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
>>>>       120                                (object boundary, size
>>>> estimate)
>>>>   VM reports 120 bytes per instance
>>>>
>>>
>>> Assuming current x86 hardware with 64-byte cache line sizes and current
>>> class layout, we can see the trailing fields in Thread are providing
>>> enough insulation from the false sharing with an adjacent object. Also,
>>> the Thread itself is large enough so that two TLRs belonging to
>>> different threads will not collide.
>>>
>>> However the leading fields are not enough: we have a few words which can
>>> occupy the same cache line, but belong to another object. This is where
>>> things can get worse in two ways: a) the TLR update can make the field
>>> access in adjacent object considerably slower; and much worse b) the
>>> update in the adjacent field can disturb the TLR state, which is
>>> critical for j.u.concurrent performance relying heavily on fast TLR.
>>>
>>> To illustrate both points, there is a simple benchmark driven by JMH
>>> (http://openjdk.java.net/projects/code-tools/jmh/):
>>>    http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
>>>
>>> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
>>> Thread with/without @Contended that microbenchmark yields the following
>>> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>>>
>>> Accessing ThreadLocalRandom.current().nextInt():
>>>    baseline:    932 +-  4 ops/usec
>>>    @Contended:  927 +- 10 ops/usec
>>>
>>> Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
>>>    baseline:    454 +-  2 ops/usec
>>>    @Contended:  490 +-  3 ops/usec
>>>
>>> One might note the $uncaughtExceptionHandler is the trailing field in
>>> the Thread, so it can naturally be false-shared with the adjacent
>>> thread's TLR. We had chosen this as the illustration, in real examples
>>> with multitude objects on the heap, we can get another contender.
>>>
>>> So that is ~10% performance hit on false sharing even on very small
>>> machine. Translating it back: having heavily-updated field in the object
>>> adjacent to Thread can bring these overheads to TLR, and then jeopardize
>>> j.u.c performance.
>>>
>>> Of course, as soon as status quo about field layout is changed, we might
>>> start to lose spectacularly. I would recommend we deal with this now, so
>>> less surprises come in the future.
>>>
>>> The caveat is that we are wasting some of the space per Thread instance.
>>> After the patch, we layout is:
>>>
>>>  java.lang.Thread
>>>>   offset  size                     type description
>>>>        0    12                          (assumed to be the object
>>>> header + first field alignment)
>>>>       12   128                          (alignment/padding gap)
>>>>      140     4                      int Thread.priority
>>>>      144     8                     long Thread.eetop
>>>>      152     8                     long Thread.stackSize
>>>>      160     8                     long Thread.nativeParkEventPointer
>>>>      168     8                     long Thread.tid
>>>>      176     8                     long Thread.threadLocalRandomSeed
>>>>      184     4                      int Thread.threadStatus
>>>>      188     4                      int Thread.threadLocalRandomProbe
>>>>      192     4                      int
>>>> Thread.threadLocalRandomSecondarySeed
>>>>      196     1                  boolean Thread.single_step
>>>>      197     1                  boolean Thread.daemon
>>>>      198     1                  boolean Thread.stillborn
>>>>      199     1                          (alignment/padding gap)
>>>>      200     4                   char[] Thread.name
>>>>      204     4                   Thread Thread.threadQ
>>>>      208     4                 Runnable Thread.target
>>>>      212     4              ThreadGroup Thread.group
>>>>      216     4              ClassLoader Thread.contextClassLoader
>>>>      220     4     AccessControlContext
>>>> Thread.inheritedAccessControlContext
>>>>      224     4           ThreadLocalMap Thread.threadLocals
>>>>      228     4           ThreadLocalMap Thread.inheritableThreadLocals
>>>>      232     4                   Object Thread.parkBlocker
>>>>      236     4            Interruptible Thread.blocker
>>>>      240     4                   Object Thread.blockerLock
>>>>      244     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
>>>>      248                                (object boundary, size estimate)
>>>> VM reports 376 bytes per instance
>>>>
>>> ...and we have additional 256 bytes per Thread (twice the
>>> -XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the
>>> space wasted in native memory for each thread, especially stack areas.
>>>
>>> Thanks,
>>> Aleksey.
>>>
>>
>>
>
>