RFR (XS) CR 8014233: java.lang.Thread should be @Contended

Thu May 9 17:56:24 UTC 2013

On 05/09/2013 04:59 PM, Laurent Bourgès wrote:
> Hi all,
>
> A stupid question:
> any ThreadLocal subclass should be marked @Contended to be sure that 
> false sharing never happens between ThreadLocal instance and any other 
> object on the heap ?
>

Hi Laurent,

ThreadLocal object is just a key (into a ThreadLocalMap). It's usually 
not subclassed to add any state but to override initialValue method. 
ThreadLocal contains a single final field 'threadLocalHashCode', which 
is read at every call to ThreadLocal.get() (usually by multiple 
threads). This can contend with a frequent write of a field in some 
other object, placed into it's proximity, yes, but I don't think we 
should put @Contended on every class that has frequently read fields. 
@Contended should be reserved for classes with fields that are 
frequently written, if I understand the concept correctly.

Regards, Peter

> Laurent
>
> 2013/5/9 Peter Levart <peter.levart at gmail.com 
> <mailto:peter.levart at gmail.com>>
>
>     Hi Aleksey,
>
>     Wouldn't it be even better if just threadLocalRandom* fields were
>     annotated with @Contended("ThreadLocal") ?
>     Some fields within the Thread object are accessed from non-local
>     threads. I don't know how frequently, but isolating just
>     threadLocalRandom* fields from all possible false-sharing
>     scenarios would seem even better, no?
>
>     Regards, Peter
>
>
>     On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
>
>         Hi,
>
>         This is from our backlog after JDK-8005926. After
>         ThreadLocalRandom
>         state was merged into Thread, we now have to deal with the
>         false sharing
>         induced by heavily-updated fields in Thread. TLR was padded
>         before, and
>         it should make sense to make Thread bear @Contended annotation to
>         isolate its fields in the same manner.
>
>         The webrev is here:
>         http://cr.openjdk.java.net/~shade/8014233/webrev.00/
>         <http://cr.openjdk.java.net/%7Eshade/8014233/webrev.00/>
>
>         Testing:
>           - microbenchmarks (see below)
>           - JPRT cycle against jdk8-tl
>
>         The extended rationale for the change follows.
>
>         If we look at the current Thread layout, we can see the TLR
>         state is
>         buried within the Thread instance. TLR state are by far the mostly
>         updated fields in Thread now:
>
>             Running 64-bit HotSpot VM.
>             Using compressed references with 3-bit shift.
>             Objects are 8 bytes aligned.
>
>             java.lang.Thread
>                offset  size                     type description
>                     0    12                          (assumed to be
>             the object header + first field alignment)
>                    12     4                      int Thread.priority
>                    16     8                     long Thread.eetop
>                    24     8                     long Thread.stackSize
>                    32     8                     long
>             Thread.nativeParkEventPointer
>                    40     8                     long Thread.tid
>                    48     8                     long
>             Thread.threadLocalRandomSeed
>                    56     4                      int Thread.threadStatus
>                    60     4                      int
>             Thread.threadLocalRandomProbe
>                    64     4                      int
>             Thread.threadLocalRandomSecondarySeed
>                    68     1                  boolean Thread.single_step
>                    69     1                  boolean Thread.daemon
>                    70     1                  boolean Thread.stillborn
>                    71     1  (alignment/padding gap)
>                    72     4                   char[] Thread.name
>                    76     4                   Thread Thread.threadQ
>                    80     4                 Runnable Thread.target
>                    84     4              ThreadGroup Thread.group
>                    88     4              ClassLoader
>             Thread.contextClassLoader
>                    92     4     AccessControlContext
>             Thread.inheritedAccessControlContext
>                    96     4           ThreadLocalMap Thread.threadLocals
>                   100     4           ThreadLocalMap
>             Thread.inheritableThreadLocals
>                   104     4                   Object Thread.parkBlocker
>                   108     4            Interruptible Thread.blocker
>                   112     4                   Object Thread.blockerLock
>                   116     4 UncaughtExceptionHandler
>             Thread.uncaughtExceptionHandler
>                   120                                (object boundary,
>             size estimate)
>               VM reports 120 bytes per instance
>
>
>         Assuming current x86 hardware with 64-byte cache line sizes
>         and current
>         class layout, we can see the trailing fields in Thread are
>         providing
>         enough insulation from the false sharing with an adjacent
>         object. Also,
>         the Thread itself is large enough so that two TLRs belonging to
>         different threads will not collide.
>
>         However the leading fields are not enough: we have a few words
>         which can
>         occupy the same cache line, but belong to another object. This
>         is where
>         things can get worse in two ways: a) the TLR update can make
>         the field
>         access in adjacent object considerably slower; and much worse
>         b) the
>         update in the adjacent field can disturb the TLR state, which is
>         critical for j.u.concurrent performance relying heavily on
>         fast TLR.
>
>         To illustrate both points, there is a simple benchmark driven
>         by JMH
>         (http://openjdk.java.net/projects/code-tools/jmh/):
>         http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
>         <http://cr.openjdk.java.net/%7Eshade/8014233/threadbench.zip>
>
>         On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
>         Thread with/without @Contended that microbenchmark yields the
>         following
>         results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
>
>         Accessing ThreadLocalRandom.current().nextInt():
>            baseline:    932 +-  4 ops/usec
>            @Contended:  927 +- 10 ops/usec
>
>         Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
>            baseline:    454 +-  2 ops/usec
>            @Contended:  490 +-  3 ops/usec
>
>         One might note the $uncaughtExceptionHandler is the trailing
>         field in
>         the Thread, so it can naturally be false-shared with the adjacent
>         thread's TLR. We had chosen this as the illustration, in real
>         examples
>         with multitude objects on the heap, we can get another contender.
>
>         So that is ~10% performance hit on false sharing even on very
>         small
>         machine. Translating it back: having heavily-updated field in
>         the object
>         adjacent to Thread can bring these overheads to TLR, and then
>         jeopardize
>         j.u.c performance.
>
>         Of course, as soon as status quo about field layout is
>         changed, we might
>         start to lose spectacularly. I would recommend we deal with
>         this now, so
>         less surprises come in the future.
>
>         The caveat is that we are wasting some of the space per Thread
>         instance.
>         After the patch, we layout is:
>
>             java.lang.Thread
>               offset  size                     type description
>                    0    12                          (assumed to be the
>             object header + first field alignment)
>                   12   128  (alignment/padding gap)
>                  140     4                      int Thread.priority
>                  144     8                     long Thread.eetop
>                  152     8                     long Thread.stackSize
>                  160     8                     long
>             Thread.nativeParkEventPointer
>                  168     8                     long Thread.tid
>                  176     8                     long
>             Thread.threadLocalRandomSeed
>                  184     4                      int Thread.threadStatus
>                  188     4                      int
>             Thread.threadLocalRandomProbe
>                  192     4                      int
>             Thread.threadLocalRandomSecondarySeed
>                  196     1                  boolean Thread.single_step
>                  197     1                  boolean Thread.daemon
>                  198     1                  boolean Thread.stillborn
>                  199     1  (alignment/padding gap)
>                  200     4                   char[] Thread.name
>                  204     4                   Thread Thread.threadQ
>                  208     4                 Runnable Thread.target
>                  212     4              ThreadGroup Thread.group
>                  216     4              ClassLoader
>             Thread.contextClassLoader
>                  220     4     AccessControlContext
>             Thread.inheritedAccessControlContext
>                  224     4           ThreadLocalMap Thread.threadLocals
>                  228     4           ThreadLocalMap
>             Thread.inheritableThreadLocals
>                  232     4                   Object Thread.parkBlocker
>                  236     4            Interruptible Thread.blocker
>                  240     4                   Object Thread.blockerLock
>                  244     4 UncaughtExceptionHandler
>             Thread.uncaughtExceptionHandler
>                  248                                (object boundary,
>             size estimate)
>             VM reports 376 bytes per instance
>
>         ...and we have additional 256 bytes per Thread (twice the
>         -XX:ContendedPaddingWidth, actually). Seems irrelevant
>         comparing to the
>         space wasted in native memory for each thread, especially
>         stack areas.
>
>         Thanks,
>         Aleksey.
>
>
>