RFR (XS) CR 8014233: java.lang.Thread should be @Contended

Fri May 10 12:22:43 UTC 2013

On 05/10/13 02:31, Laurent Bourgès wrote:
> Peter,
>
> you're absolutely right: I was thinking about thread local values (object
> instances) and not ThreadLocal keys !
> I think ThreadLocal name is confusing as it does not correspond to values !
>
> Several times I wonder if false sharing can happen between my thread local
> values (i.e. different Thread context classes) and any other object
> including other Thread contexts).

As Peter implied, this would in general be overkill. Every use
of @Contended should be an empirically guided time/space tradeoff.
There are specific classes used as ThreadLocals that may warrant this.
For example, java.util.concurrent.Exchanger has one.

>
> Is the GC (old gen) able to place objects in thread dedicated area: it
> would so avoid any false sharing between object graphs dedicated to each
> thread = thread isolation.

No it doesn't. Some collectors use some heuristics that tend to
keep per-thread objects together, but there are no guarantees.

-Doug

>
> I think that TLAB does so for allocation / short lived objects but for the
> old generation (long lived objects) it is not the case: maybe G1 can
> provide different partitioning and maybe take into acccount the thread
> affinity ?
>
> Laurent
>
> 2013/5/9 Peter Levart <peter.levart at gmail.com>
>
>>
>> On 05/09/2013 04:59 PM, Laurent Bourgès wrote:
>>
>> Hi all,
>>
>> A stupid question:
>> any ThreadLocal subclass should be marked @Contended to be sure that false
>> sharing never happens between ThreadLocal instance and any other object on
>> the heap ?
>>
>>
>> Hi Laurent,
>>
>> ThreadLocal object is just a key (into a ThreadLocalMap). It's usually not
>> subclassed to add any state but to override initialValue method.
>> ThreadLocal contains a single final field 'threadLocalHashCode', which is
>> read at every call to ThreadLocal.get() (usually by multiple threads). This
>> can contend with a frequent write of a field in some other object, placed
>> into it's proximity, yes, but I don't think we should put @Contended on
>> every class that has frequently read fields. @Contended should be reserved
>> for classes with fields that are frequently written, if I understand the
>> concept correctly.
>>
>> Regards, Peter
>>
>>
>> Laurent
>>
>>   2013/5/9 Peter Levart <peter.levart at gmail.com>
>>
>>> Hi Aleksey,
>>>
>>> Wouldn't it be even better if just threadLocalRandom* fields were
>>> annotated with @Contended("ThreadLocal") ?
>>> Some fields within the Thread object are accessed from non-local threads.
>>> I don't know how frequently, but isolating just threadLocalRandom* fields
>>> from all possible false-sharing scenarios would seem even better, no?
>>>
>>> Regards, Peter
>>>
>>>
>>> On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>>>
>>>> Hi,
>>>>
>>>> This is from our backlog after JDK-8005926. After ThreadLocalRandom
>>>> state was merged into Thread, we now have to deal with the false sharing
>>>> induced by heavily-updated fields in Thread. TLR was padded before, and
>>>> it should make sense to make Thread bear @Contended annotation to
>>>> isolate its fields in the same manner.
>>>>
>>>> The webrev is here:
>>>>      http://cr.openjdk.java.net/~shade/8014233/webrev.00/
>>>>
>>>> Testing:
>>>>    - microbenchmarks (see below)
>>>>    - JPRT cycle against jdk8-tl
>>>>
>>>> The extended rationale for the change follows.
>>>>
>>>> If we look at the current Thread layout, we can see the TLR state is
>>>> buried within the Thread instance. TLR state are by far the mostly
>>>> updated fields in Thread now:
>>>>
>>>>   Running 64-bit HotSpot VM.
>>>>> Using compressed references with 3-bit shift.
>>>>> Objects are 8 bytes aligned.
>>>>>
>>>>> java.lang.Thread
>>>>>     offset  size                     type description
>>>>>          0    12                          (assumed to be the object
>>>>> header + first field alignment)
>>>>>         12     4                      int Thread.priority
>>>>>         16     8                     long Thread.eetop
>>>>>         24     8                     long Thread.stackSize
>>>>>         32     8                     long Thread.nativeParkEventPointer
>>>>>         40     8                     long Thread.tid
>>>>>         48     8                     long Thread.threadLocalRandomSeed
>>>>>         56     4                      int Thread.threadStatus
>>>>>         60     4                      int Thread.threadLocalRandomProbe
>>>>>         64     4                      int
>>>>> Thread.threadLocalRandomSecondarySeed
>>>>>         68     1                  boolean Thread.single_step
>>>>>         69     1                  boolean Thread.daemon
>>>>>         70     1                  boolean Thread.stillborn
>>>>>         71     1                          (alignment/padding gap)
>>>>>         72     4                   char[] Thread.name
>>>>>         76     4                   Thread Thread.threadQ
>>>>>         80     4                 Runnable Thread.target
>>>>>         84     4              ThreadGroup Thread.group
>>>>>         88     4              ClassLoader Thread.contextClassLoader
>>>>>         92     4     AccessControlContext
>>>>> Thread.inheritedAccessControlContext
>>>>>         96     4           ThreadLocalMap Thread.threadLocals
>>>>>        100     4           ThreadLocalMap Thread.inheritableThreadLocals
>>>>>        104     4                   Object Thread.parkBlocker
>>>>>        108     4            Interruptible Thread.blocker
>>>>>        112     4                   Object Thread.blockerLock
>>>>>        116     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
>>>>>        120                                (object boundary, size
>>>>> estimate)
>>>>>    VM reports 120 bytes per instance
>>>>>
>>>>
>>>> Assuming current x86 hardware with 64-byte cache line sizes and current
>>>> class layout, we can see the trailing fields in Thread are providing
>>>> enough insulation from the false sharing with an adjacent object. Also,
>>>> the Thread itself is large enough so that two TLRs belonging to
>>>> different threads will not collide.
>>>>
>>>> However the leading fields are not enough: we have a few words which can
>>>> occupy the same cache line, but belong to another object. This is where
>>>> things can get worse in two ways: a) the TLR update can make the field
>>>> access in adjacent object considerably slower; and much worse b) the
>>>> update in the adjacent field can disturb the TLR state, which is
>>>> critical for j.u.concurrent performance relying heavily on fast TLR.
>>>>
>>>> To illustrate both points, there is a simple benchmark driven by JMH
>>>> (http://openjdk.java.net/projects/code-tools/jmh/):
>>>>     http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
>>>>
>>>> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
>>>> Thread with/without @Contended that microbenchmark yields the following
>>>> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>>>>
>>>> Accessing ThreadLocalRandom.current().nextInt():
>>>>     baseline:    932 +-  4 ops/usec
>>>>     @Contended:  927 +- 10 ops/usec
>>>>
>>>> Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
>>>>     baseline:    454 +-  2 ops/usec
>>>>     @Contended:  490 +-  3 ops/usec
>>>>
>>>> One might note the $uncaughtExceptionHandler is the trailing field in
>>>> the Thread, so it can naturally be false-shared with the adjacent
>>>> thread's TLR. We had chosen this as the illustration, in real examples
>>>> with multitude objects on the heap, we can get another contender.
>>>>
>>>> So that is ~10% performance hit on false sharing even on very small
>>>> machine. Translating it back: having heavily-updated field in the object
>>>> adjacent to Thread can bring these overheads to TLR, and then jeopardize
>>>> j.u.c performance.
>>>>
>>>> Of course, as soon as status quo about field layout is changed, we might
>>>> start to lose spectacularly. I would recommend we deal with this now, so
>>>> less surprises come in the future.
>>>>
>>>> The caveat is that we are wasting some of the space per Thread instance.
>>>> After the patch, we layout is:
>>>>
>>>>   java.lang.Thread
>>>>>    offset  size                     type description
>>>>>         0    12                          (assumed to be the object
>>>>> header + first field alignment)
>>>>>        12   128                          (alignment/padding gap)
>>>>>       140     4                      int Thread.priority
>>>>>       144     8                     long Thread.eetop
>>>>>       152     8                     long Thread.stackSize
>>>>>       160     8                     long Thread.nativeParkEventPointer
>>>>>       168     8                     long Thread.tid
>>>>>       176     8                     long Thread.threadLocalRandomSeed
>>>>>       184     4                      int Thread.threadStatus
>>>>>       188     4                      int Thread.threadLocalRandomProbe
>>>>>       192     4                      int
>>>>> Thread.threadLocalRandomSecondarySeed
>>>>>       196     1                  boolean Thread.single_step
>>>>>       197     1                  boolean Thread.daemon
>>>>>       198     1                  boolean Thread.stillborn
>>>>>       199     1                          (alignment/padding gap)
>>>>>       200     4                   char[] Thread.name
>>>>>       204     4                   Thread Thread.threadQ
>>>>>       208     4                 Runnable Thread.target
>>>>>       212     4              ThreadGroup Thread.group
>>>>>       216     4              ClassLoader Thread.contextClassLoader
>>>>>       220     4     AccessControlContext
>>>>> Thread.inheritedAccessControlContext
>>>>>       224     4           ThreadLocalMap Thread.threadLocals
>>>>>       228     4           ThreadLocalMap Thread.inheritableThreadLocals
>>>>>       232     4                   Object Thread.parkBlocker
>>>>>       236     4            Interruptible Thread.blocker
>>>>>       240     4                   Object Thread.blockerLock
>>>>>       244     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
>>>>>       248                                (object boundary, size estimate)
>>>>> VM reports 376 bytes per instance
>>>>>
>>>> ...and we have additional 256 bytes per Thread (twice the
>>>> -XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the
>>>> space wasted in native memory for each thread, especially stack areas.
>>>>
>>>> Thanks,
>>>> Aleksey.
>>>>
>>>
>>>
>>
>>
>