RFR (XS) CR 8014233: java.lang.Thread should be @Contended
Peter Levart
peter.levart at gmail.com
Thu May 9 17:56:24 UTC 2013
On 05/09/2013 04:59 PM, Laurent Bourgès wrote:
> Hi all,
>
> A stupid question:
> any ThreadLocal subclass should be marked @Contended to be sure that
> false sharing never happens between ThreadLocal instance and any other
> object on the heap ?
>
Hi Laurent,
ThreadLocal object is just a key (into a ThreadLocalMap). It's usually
not subclassed to add any state but to override initialValue method.
ThreadLocal contains a single final field 'threadLocalHashCode', which
is read at every call to ThreadLocal.get() (usually by multiple
threads). This can contend with a frequent write of a field in some
other object, placed into it's proximity, yes, but I don't think we
should put @Contended on every class that has frequently read fields.
@Contended should be reserved for classes with fields that are
frequently written, if I understand the concept correctly.
Regards, Peter
> Laurent
>
> 2013/5/9 Peter Levart <peter.levart at gmail.com
> <mailto:peter.levart at gmail.com>>
>
> Hi Aleksey,
>
> Wouldn't it be even better if just threadLocalRandom* fields were
> annotated with @Contended("ThreadLocal") ?
> Some fields within the Thread object are accessed from non-local
> threads. I don't know how frequently, but isolating just
> threadLocalRandom* fields from all possible false-sharing
> scenarios would seem even better, no?
>
> Regards, Peter
>
>
> On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>
> Hi,
>
> This is from our backlog after JDK-8005926. After
> ThreadLocalRandom
> state was merged into Thread, we now have to deal with the
> false sharing
> induced by heavily-updated fields in Thread. TLR was padded
> before, and
> it should make sense to make Thread bear @Contended annotation to
> isolate its fields in the same manner.
>
> The webrev is here:
> http://cr.openjdk.java.net/~shade/8014233/webrev.00/
> <http://cr.openjdk.java.net/%7Eshade/8014233/webrev.00/>
>
> Testing:
> - microbenchmarks (see below)
> - JPRT cycle against jdk8-tl
>
> The extended rationale for the change follows.
>
> If we look at the current Thread layout, we can see the TLR
> state is
> buried within the Thread instance. TLR state are by far the mostly
> updated fields in Thread now:
>
> Running 64-bit HotSpot VM.
> Using compressed references with 3-bit shift.
> Objects are 8 bytes aligned.
>
> java.lang.Thread
> offset size type description
> 0 12 (assumed to be
> the object header + first field alignment)
> 12 4 int Thread.priority
> 16 8 long Thread.eetop
> 24 8 long Thread.stackSize
> 32 8 long
> Thread.nativeParkEventPointer
> 40 8 long Thread.tid
> 48 8 long
> Thread.threadLocalRandomSeed
> 56 4 int Thread.threadStatus
> 60 4 int
> Thread.threadLocalRandomProbe
> 64 4 int
> Thread.threadLocalRandomSecondarySeed
> 68 1 boolean Thread.single_step
> 69 1 boolean Thread.daemon
> 70 1 boolean Thread.stillborn
> 71 1 (alignment/padding gap)
> 72 4 char[] Thread.name
> 76 4 Thread Thread.threadQ
> 80 4 Runnable Thread.target
> 84 4 ThreadGroup Thread.group
> 88 4 ClassLoader
> Thread.contextClassLoader
> 92 4 AccessControlContext
> Thread.inheritedAccessControlContext
> 96 4 ThreadLocalMap Thread.threadLocals
> 100 4 ThreadLocalMap
> Thread.inheritableThreadLocals
> 104 4 Object Thread.parkBlocker
> 108 4 Interruptible Thread.blocker
> 112 4 Object Thread.blockerLock
> 116 4 UncaughtExceptionHandler
> Thread.uncaughtExceptionHandler
> 120 (object boundary,
> size estimate)
> VM reports 120 bytes per instance
>
>
> Assuming current x86 hardware with 64-byte cache line sizes
> and current
> class layout, we can see the trailing fields in Thread are
> providing
> enough insulation from the false sharing with an adjacent
> object. Also,
> the Thread itself is large enough so that two TLRs belonging to
> different threads will not collide.
>
> However the leading fields are not enough: we have a few words
> which can
> occupy the same cache line, but belong to another object. This
> is where
> things can get worse in two ways: a) the TLR update can make
> the field
> access in adjacent object considerably slower; and much worse
> b) the
> update in the adjacent field can disturb the TLR state, which is
> critical for j.u.concurrent performance relying heavily on
> fast TLR.
>
> To illustrate both points, there is a simple benchmark driven
> by JMH
> (http://openjdk.java.net/projects/code-tools/jmh/):
> http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
> <http://cr.openjdk.java.net/%7Eshade/8014233/threadbench.zip>
>
> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
> Thread with/without @Contended that microbenchmark yields the
> following
> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>
> Accessing ThreadLocalRandom.current().nextInt():
> baseline: 932 +- 4 ops/usec
> @Contended: 927 +- 10 ops/usec
>
> Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
> baseline: 454 +- 2 ops/usec
> @Contended: 490 +- 3 ops/usec
>
> One might note the $uncaughtExceptionHandler is the trailing
> field in
> the Thread, so it can naturally be false-shared with the adjacent
> thread's TLR. We had chosen this as the illustration, in real
> examples
> with multitude objects on the heap, we can get another contender.
>
> So that is ~10% performance hit on false sharing even on very
> small
> machine. Translating it back: having heavily-updated field in
> the object
> adjacent to Thread can bring these overheads to TLR, and then
> jeopardize
> j.u.c performance.
>
> Of course, as soon as status quo about field layout is
> changed, we might
> start to lose spectacularly. I would recommend we deal with
> this now, so
> less surprises come in the future.
>
> The caveat is that we are wasting some of the space per Thread
> instance.
> After the patch, we layout is:
>
> java.lang.Thread
> offset size type description
> 0 12 (assumed to be the
> object header + first field alignment)
> 12 128 (alignment/padding gap)
> 140 4 int Thread.priority
> 144 8 long Thread.eetop
> 152 8 long Thread.stackSize
> 160 8 long
> Thread.nativeParkEventPointer
> 168 8 long Thread.tid
> 176 8 long
> Thread.threadLocalRandomSeed
> 184 4 int Thread.threadStatus
> 188 4 int
> Thread.threadLocalRandomProbe
> 192 4 int
> Thread.threadLocalRandomSecondarySeed
> 196 1 boolean Thread.single_step
> 197 1 boolean Thread.daemon
> 198 1 boolean Thread.stillborn
> 199 1 (alignment/padding gap)
> 200 4 char[] Thread.name
> 204 4 Thread Thread.threadQ
> 208 4 Runnable Thread.target
> 212 4 ThreadGroup Thread.group
> 216 4 ClassLoader
> Thread.contextClassLoader
> 220 4 AccessControlContext
> Thread.inheritedAccessControlContext
> 224 4 ThreadLocalMap Thread.threadLocals
> 228 4 ThreadLocalMap
> Thread.inheritableThreadLocals
> 232 4 Object Thread.parkBlocker
> 236 4 Interruptible Thread.blocker
> 240 4 Object Thread.blockerLock
> 244 4 UncaughtExceptionHandler
> Thread.uncaughtExceptionHandler
> 248 (object boundary,
> size estimate)
> VM reports 376 bytes per instance
>
> ...and we have additional 256 bytes per Thread (twice the
> -XX:ContendedPaddingWidth, actually). Seems irrelevant
> comparing to the
> space wasted in native memory for each thread, especially
> stack areas.
>
> Thanks,
> Aleksey.
>
>
>
More information about the core-libs-dev
mailing list