RFR (XS) CR 8014233: java.lang.Thread should be @Contended
Doug Lea
dl at cs.oswego.edu
Fri May 10 12:22:43 UTC 2013
On 05/10/13 02:31, Laurent Bourgès wrote:
> Peter,
>
> you're absolutely right: I was thinking about thread local values (object
> instances) and not ThreadLocal keys !
> I think ThreadLocal name is confusing as it does not correspond to values !
>
> Several times I wonder if false sharing can happen between my thread local
> values (i.e. different Thread context classes) and any other object
> including other Thread contexts).
As Peter implied, this would in general be overkill. Every use
of @Contended should be an empirically guided time/space tradeoff.
There are specific classes used as ThreadLocals that may warrant this.
For example, java.util.concurrent.Exchanger has one.
>
> Is the GC (old gen) able to place objects in thread dedicated area: it
> would so avoid any false sharing between object graphs dedicated to each
> thread = thread isolation.
No it doesn't. Some collectors use some heuristics that tend to
keep per-thread objects together, but there are no guarantees.
-Doug
>
> I think that TLAB does so for allocation / short lived objects but for the
> old generation (long lived objects) it is not the case: maybe G1 can
> provide different partitioning and maybe take into acccount the thread
> affinity ?
>
> Laurent
>
> 2013/5/9 Peter Levart <peter.levart at gmail.com>
>
>>
>> On 05/09/2013 04:59 PM, Laurent Bourgès wrote:
>>
>> Hi all,
>>
>> A stupid question:
>> any ThreadLocal subclass should be marked @Contended to be sure that false
>> sharing never happens between ThreadLocal instance and any other object on
>> the heap ?
>>
>>
>> Hi Laurent,
>>
>> ThreadLocal object is just a key (into a ThreadLocalMap). It's usually not
>> subclassed to add any state but to override initialValue method.
>> ThreadLocal contains a single final field 'threadLocalHashCode', which is
>> read at every call to ThreadLocal.get() (usually by multiple threads). This
>> can contend with a frequent write of a field in some other object, placed
>> into it's proximity, yes, but I don't think we should put @Contended on
>> every class that has frequently read fields. @Contended should be reserved
>> for classes with fields that are frequently written, if I understand the
>> concept correctly.
>>
>> Regards, Peter
>>
>>
>> Laurent
>>
>> 2013/5/9 Peter Levart <peter.levart at gmail.com>
>>
>>> Hi Aleksey,
>>>
>>> Wouldn't it be even better if just threadLocalRandom* fields were
>>> annotated with @Contended("ThreadLocal") ?
>>> Some fields within the Thread object are accessed from non-local threads.
>>> I don't know how frequently, but isolating just threadLocalRandom* fields
>>> from all possible false-sharing scenarios would seem even better, no?
>>>
>>> Regards, Peter
>>>
>>>
>>> On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>>>
>>>> Hi,
>>>>
>>>> This is from our backlog after JDK-8005926. After ThreadLocalRandom
>>>> state was merged into Thread, we now have to deal with the false sharing
>>>> induced by heavily-updated fields in Thread. TLR was padded before, and
>>>> it should make sense to make Thread bear @Contended annotation to
>>>> isolate its fields in the same manner.
>>>>
>>>> The webrev is here:
>>>> http://cr.openjdk.java.net/~shade/8014233/webrev.00/
>>>>
>>>> Testing:
>>>> - microbenchmarks (see below)
>>>> - JPRT cycle against jdk8-tl
>>>>
>>>> The extended rationale for the change follows.
>>>>
>>>> If we look at the current Thread layout, we can see the TLR state is
>>>> buried within the Thread instance. TLR state are by far the mostly
>>>> updated fields in Thread now:
>>>>
>>>> Running 64-bit HotSpot VM.
>>>>> Using compressed references with 3-bit shift.
>>>>> Objects are 8 bytes aligned.
>>>>>
>>>>> java.lang.Thread
>>>>> offset size type description
>>>>> 0 12 (assumed to be the object
>>>>> header + first field alignment)
>>>>> 12 4 int Thread.priority
>>>>> 16 8 long Thread.eetop
>>>>> 24 8 long Thread.stackSize
>>>>> 32 8 long Thread.nativeParkEventPointer
>>>>> 40 8 long Thread.tid
>>>>> 48 8 long Thread.threadLocalRandomSeed
>>>>> 56 4 int Thread.threadStatus
>>>>> 60 4 int Thread.threadLocalRandomProbe
>>>>> 64 4 int
>>>>> Thread.threadLocalRandomSecondarySeed
>>>>> 68 1 boolean Thread.single_step
>>>>> 69 1 boolean Thread.daemon
>>>>> 70 1 boolean Thread.stillborn
>>>>> 71 1 (alignment/padding gap)
>>>>> 72 4 char[] Thread.name
>>>>> 76 4 Thread Thread.threadQ
>>>>> 80 4 Runnable Thread.target
>>>>> 84 4 ThreadGroup Thread.group
>>>>> 88 4 ClassLoader Thread.contextClassLoader
>>>>> 92 4 AccessControlContext
>>>>> Thread.inheritedAccessControlContext
>>>>> 96 4 ThreadLocalMap Thread.threadLocals
>>>>> 100 4 ThreadLocalMap Thread.inheritableThreadLocals
>>>>> 104 4 Object Thread.parkBlocker
>>>>> 108 4 Interruptible Thread.blocker
>>>>> 112 4 Object Thread.blockerLock
>>>>> 116 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
>>>>> 120 (object boundary, size
>>>>> estimate)
>>>>> VM reports 120 bytes per instance
>>>>>
>>>>
>>>> Assuming current x86 hardware with 64-byte cache line sizes and current
>>>> class layout, we can see the trailing fields in Thread are providing
>>>> enough insulation from the false sharing with an adjacent object. Also,
>>>> the Thread itself is large enough so that two TLRs belonging to
>>>> different threads will not collide.
>>>>
>>>> However the leading fields are not enough: we have a few words which can
>>>> occupy the same cache line, but belong to another object. This is where
>>>> things can get worse in two ways: a) the TLR update can make the field
>>>> access in adjacent object considerably slower; and much worse b) the
>>>> update in the adjacent field can disturb the TLR state, which is
>>>> critical for j.u.concurrent performance relying heavily on fast TLR.
>>>>
>>>> To illustrate both points, there is a simple benchmark driven by JMH
>>>> (http://openjdk.java.net/projects/code-tools/jmh/):
>>>> http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
>>>>
>>>> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
>>>> Thread with/without @Contended that microbenchmark yields the following
>>>> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>>>>
>>>> Accessing ThreadLocalRandom.current().nextInt():
>>>> baseline: 932 +- 4 ops/usec
>>>> @Contended: 927 +- 10 ops/usec
>>>>
>>>> Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
>>>> baseline: 454 +- 2 ops/usec
>>>> @Contended: 490 +- 3 ops/usec
>>>>
>>>> One might note the $uncaughtExceptionHandler is the trailing field in
>>>> the Thread, so it can naturally be false-shared with the adjacent
>>>> thread's TLR. We had chosen this as the illustration, in real examples
>>>> with multitude objects on the heap, we can get another contender.
>>>>
>>>> So that is ~10% performance hit on false sharing even on very small
>>>> machine. Translating it back: having heavily-updated field in the object
>>>> adjacent to Thread can bring these overheads to TLR, and then jeopardize
>>>> j.u.c performance.
>>>>
>>>> Of course, as soon as status quo about field layout is changed, we might
>>>> start to lose spectacularly. I would recommend we deal with this now, so
>>>> less surprises come in the future.
>>>>
>>>> The caveat is that we are wasting some of the space per Thread instance.
>>>> After the patch, we layout is:
>>>>
>>>> java.lang.Thread
>>>>> offset size type description
>>>>> 0 12 (assumed to be the object
>>>>> header + first field alignment)
>>>>> 12 128 (alignment/padding gap)
>>>>> 140 4 int Thread.priority
>>>>> 144 8 long Thread.eetop
>>>>> 152 8 long Thread.stackSize
>>>>> 160 8 long Thread.nativeParkEventPointer
>>>>> 168 8 long Thread.tid
>>>>> 176 8 long Thread.threadLocalRandomSeed
>>>>> 184 4 int Thread.threadStatus
>>>>> 188 4 int Thread.threadLocalRandomProbe
>>>>> 192 4 int
>>>>> Thread.threadLocalRandomSecondarySeed
>>>>> 196 1 boolean Thread.single_step
>>>>> 197 1 boolean Thread.daemon
>>>>> 198 1 boolean Thread.stillborn
>>>>> 199 1 (alignment/padding gap)
>>>>> 200 4 char[] Thread.name
>>>>> 204 4 Thread Thread.threadQ
>>>>> 208 4 Runnable Thread.target
>>>>> 212 4 ThreadGroup Thread.group
>>>>> 216 4 ClassLoader Thread.contextClassLoader
>>>>> 220 4 AccessControlContext
>>>>> Thread.inheritedAccessControlContext
>>>>> 224 4 ThreadLocalMap Thread.threadLocals
>>>>> 228 4 ThreadLocalMap Thread.inheritableThreadLocals
>>>>> 232 4 Object Thread.parkBlocker
>>>>> 236 4 Interruptible Thread.blocker
>>>>> 240 4 Object Thread.blockerLock
>>>>> 244 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
>>>>> 248 (object boundary, size estimate)
>>>>> VM reports 376 bytes per instance
>>>>>
>>>> ...and we have additional 256 bytes per Thread (twice the
>>>> -XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the
>>>> space wasted in native memory for each thread, especially stack areas.
>>>>
>>>> Thanks,
>>>> Aleksey.
>>>>
>>>
>>>
>>
>>
>
More information about the core-libs-dev
mailing list