RFR (XS) CR 8014233: java.lang.Thread should have @Contended on TLR fields

Tue Jun 18 06:56:30 UTC 2013

Hi David,

It depends on the scenario we are assessing. For the sake of argument,
let's say every thread had requested TLR.current() at least once.

Before the merge:
 Thread maps for ThreadLocal =~ 32 bytes x #threads
 TLR instances + padding =~ (128 + 8?) bytes x #threads

After the merge:
 TLR fields in Thread + padding =~ (2x128 + 16) x #threads

So, there is the additional footprint cost per Thread; but that seems
abysmal comparing to what native thread already allocates for its native
structures (e.g. stack). Note that @Contended does larger padding
anticipating the hardware prefetchers also turned on (VM can get better
at this though).

Gory details:

**** -XX:-EnableContended: ****

Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Objects are 8 bytes aligned.

java.lang.Thread
 offset  size                     type description
      0    12                          (assumed to be the object header
+ first field alignment)
     12     4                      int Thread.priority
     16     8                     long Thread.eetop
     24     8                     long Thread.stackSize
     32     8                     long Thread.nativeParkEventPointer
     40     8                     long Thread.tid
     48     8                     long Thread.threadLocalRandomSeed
     56     4                      int Thread.threadStatus
     60     4                      int Thread.threadLocalRandomProbe
     64     4                      int Thread.threadLocalRandomSecondarySeed
     68     1                  boolean Thread.single_step
     69     1                  boolean Thread.daemon
     70     1                  boolean Thread.stillborn
     71     1                          (alignment/padding gap)
     72     4                   char[] Thread.name
     76     4                   Thread Thread.threadQ
     80     4                 Runnable Thread.target
     84     4              ThreadGroup Thread.group
     88     4              ClassLoader Thread.contextClassLoader
     92     4     AccessControlContext Thread.inheritedAccessControlContext
     96     4           ThreadLocalMap Thread.threadLocals
    100     4           ThreadLocalMap Thread.inheritableThreadLocals
    104     4                   Object Thread.parkBlocker
    108     4            Interruptible Thread.blocker
    112     4                   Object Thread.blockerLock
    116     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
    120                                (object boundary, size estimate)
VM reports 120 bytes per instance

**** -XX:+EnableContended: ****

Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Objects are 8 bytes aligned.

java.lang.Thread
 offset  size                     type description
      0    12                          (assumed to be the object header
+ first field alignment)
     12     4                      int Thread.priority
     16     8                     long Thread.eetop
     24     8                     long Thread.stackSize
     32     8                     long Thread.nativeParkEventPointer
     40     8                     long Thread.tid
     48     4                      int Thread.threadStatus
     52     1                  boolean Thread.single_step
     53     1                  boolean Thread.daemon
     54     1                  boolean Thread.stillborn
     55     1                          (alignment/padding gap)
     56     4                   char[] Thread.name
     60     4                   Thread Thread.threadQ
     64     4                 Runnable Thread.target
     68     4              ThreadGroup Thread.group
     72     4              ClassLoader Thread.contextClassLoader
     76     4     AccessControlContext Thread.inheritedAccessControlContext
     80     4           ThreadLocalMap Thread.threadLocals
     84     4           ThreadLocalMap Thread.inheritableThreadLocals
     88     4                   Object Thread.parkBlocker
     92     4            Interruptible Thread.blocker
     96     4                   Object Thread.blockerLock
    100     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
    104   128                          (alignment/padding gap)
    232     8                     long Thread.threadLocalRandomSeed
    240     4                      int Thread.threadLocalRandomProbe
    244     4                      int Thread.threadLocalRandomSecondarySeed
    248                                (object boundary, size estimate)
VM reports 376 bytes per instance

-Aleksey.

On 06/18/2013 06:03 AM, David Holmes wrote:
> Hi Aleksey,
> 
> What is the overall change in memory use for this set of changes ie what
> did we use pre TLR merging and what do we use now?
> 
> Thanks,
> David
> 
> On 17/06/2013 7:00 PM, Aleksey Shipilev wrote:
>> Hi,
>>
>> This is the respin of the RFE filed a month ago:
>>   
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2013-May/016754.html
>>
>> The webrev is here:
>>    http://cr.openjdk.java.net/~shade/8014233/webrev.02/
>>
>> Testing:
>>    - JPRT build passes
>>    - Linux x86_64/release passes jdk/java/lang jtreg
>>    - vm.quick.testlist, vm.quick-gc.testlist on selected platforms
>>    - microbenchmarks, see below
>>
>> The rationale follows.
>>
>> After we merged ThreadLocalRandom state in the thread, we are now
>> missing the padding to prevent false sharing on those heavily-updated
>> fields. While the Thread is already large enough to separate two TLR
>> states for two distinct threads, we can still get the false sharing with
>> other thread fields.
>>
>> There is the benchmark showcasing this:
>>    http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
>>
>> There are two test cases: first one is only calling its own TLR with
>> nextInt() and then the current thread's ID, another test calls *another*
>> thread ID, thus inducing the false sharing against another thread's TLR
>> state.
>>
>> On my 2x2 i5 laptop, running Linux x86_64:
>>    same:    355 +- 1 ops/usec
>>    other:   100 +- 5 ops/usec
>>
>> Note the decrease in throughput because of the false sharing.
>>
>> With the patch:
>>    same:    359 +- 1 ops/usec
>>    other:   356 +- 1 ops/usec
>>
>> Note the performance is back. We want to evade these spurious decreases
>> in performance, due to either unlucky memory layout, or the user code
>> (un)intentionally ruining the cache line locality for the updater thread.
>>
>> Thanks,
>> -Aleksey.
>>