RFR: 8291555: Implement alternative fast-locking scheme [v62]

Wed Apr 26 16:34:53 UTC 2023

On Thu, 20 Apr 2023 11:15:47 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change adds a fast-locking scheme as an alternative to the current stack-locking implementation. It retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation when trying to access the mark-word. And because of the very racy nature, this turns out to be very complex and would involve a variant of the inflation protocol to ensure that the object header is stable. (The current implementation of setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations (and we already do). The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is fixed size, currently with 8 elements. According to my experiments with various workloads, this covers the vast majority of workloads (in-fact, most workloads seem to never exceed 5 active locks per thread at a time). We check for overflow in the fast-paths and when the lock-stack is full, we take the slow-path, which would inflate the lock to a monitor. That case should be very rare.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR
>> 
>> Also, and I might be mistaken here, this new lightweight locking would make synchronized work better with Loom: Because the lock-records are no longer scattered across the stack, but instead are densely packed into the lock-stack, it should be easy for a vthread to save its lock-stack upon unmounting and restore it when re-mounting. However, I am not sure about this, and this PR does not attempt to implement that support.
>> 
>> Testing:
>>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>>  - [x] Several real-world applications have been tested with this change in tandem with Lilliput without any problems, yet
>> 
>> ### Performance
>> 
>> #### Simple Microbenchmark
>> 
>> The microbenchmark exercises only the locking primitives for monitorenter and monitorexit, without contention. The benchmark can be found (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.
>> 
>> |  | x86_64 | aarch64 |
>> | -- | -- | -- |
>> | -UseFastLocking | 20.651 | 20.764 |
>> | +UseFastLocking | 18.896 | 18.908 |
>> 
>> 
>> #### Renaissance
>> 
>>   | x86_64 |   |   |   | aarch64 |   |  
>> -- | -- | -- | -- | -- | -- | -- | --
>>   | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
>> AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
>> Reactors | 11041.427 | 11181.451 | -1.25% |   | 11381.751 | 11521.318 | -1.21%
>> Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
>> ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
>> GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
>> LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
>> MovieLens | 8248.116 | 8314.576 | -0.80% |   | 7569.219 | 7646.828 | -1.01%%
>> NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
>> PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
>> FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
>> FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | -0.12%
>> ParMnemonics | 2434.551 | 2468.763 | -1.39% |   | 4278.225 | 4263.863 | 0.34%
>> Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
>> RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
>> Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
>> ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
>> ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
>> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | -0.37%
>> Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 0.97%
>> FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | -0.39%
>> FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%
>
> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Remove unnecessary comments

src/hotspot/cpu/aarch64/aarch64.ad line 3875:

> 3873:       __ b(cont);
> 3874:     } else {
> 3875:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/aarch64/aarch64.ad line 3956:

> 3954:       __ b(cont);
> 3955:     } else {
> 3956:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1813:

> 1811:       __ br(Assembler::NE, slow_path_lock);
> 1812:     } else {
> 1813:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/arm/c2_MacroAssembler_arm.cpp line 100:

> 98:     fast_lock_2(Roop /* obj */, Rbox /* t1 */, Rscratch /* t2 */, Rscratch2 /* t3 */,
> 99:                   1 /* savemask (save t1) */,
> 100:                   done);

Why not line up the '1' below the 'R' in Roop and join with the 'done);' line?

src/hotspot/cpu/arm/c2_MacroAssembler_arm.cpp line 152:

> 150:     fast_unlock_2(Roop /* obj */, Rbox /* t1 */, Rscratch /* t2 */, Rscratch2 /* t3 */,
> 151:                     1 /* savemask (save t1) */,
> 152:                     done);

Why not line up the '1' below the 'R' in Roop and join with the 'done);' line?

src/hotspot/cpu/arm/interp_masm_arm.cpp line 900:

> 898:       b(done);
> 899: 
> 900:     } else if (LockingMode == LM_LEGACY) {

Why so many blank lines in this new block?

src/hotspot/cpu/arm/interp_masm_arm.cpp line 1025:

> 1023:       fast_unlock_2(Robj /* obj */, Rlock /* t1 */, Rmark /* t2 */, Rtemp /* t3 */,
> 1024:                       1 /* savemask (save t1) */,
> 1025:                       slow_case);

Why not line up the '1' below the 'R' in Roop and join with the 'done);' line?

src/hotspot/cpu/arm/macroAssembler_arm.cpp line 1803:

> 1801: #ifdef ASSERT
> 1802:   // Poison scratch regs
> 1803:   POISON_REGS((~savemask), t1, t2, t3, 0x10000001);

Should this poison value be: 0x20000002

src/hotspot/cpu/arm/macroAssembler_arm.cpp line 1811:

> 1809: // Attempt to fast-unlock an object
> 1810: // Registers:
> 1811: //  - obj: the object to be locked

nit typo: s/locked/unlocked/

src/hotspot/cpu/arm/macroAssembler_arm.hpp line 1023:

> 1021:   // Attempt to fast-unlock an object
> 1022:   // Registers:
> 1023:   //  - obj: the object to be locked

nit typo: s/locked/unlocked/

src/hotspot/cpu/arm/sharedRuntime_arm.cpp line 649:

> 647: 
> 648:   __ flush();
> 649:   return AdapterHandlerLibrary::new_entry(fingerprint, i2c_entry, c2i_entry, c2i_unverified_entry);

This change seems out of place... what's the story here?

src/hotspot/cpu/arm/sharedRuntime_arm.cpp line 1246:

> 1244:     if (LockingMode == LM_LIGHTWEIGHT) {
> 1245:       log_trace(fastlock2)("SharedRuntime unlock fast");
> 1246:       __ fast_unlock_2(sync_obj, R2, tmp, Rtemp, 7, slow_unlock);

No comments on the params like in other places...

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9694:

> 9692: 
> 9693:   // Now we attempt to take the fast-lock.
> 9694:   // Clear lowest two header bits (locked state).

Perhaps:

  // Clear lock_mask bits (locked state).

so that you don't tie this comment to the implementation size of the lock_mask.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9697:

> 9695:   andptr(hdr, ~(int32_t)markWord::lock_mask_in_place);
> 9696:   movptr(tmp, hdr);
> 9697:   // Set lowest bit (unlocked state).

Perhaps:

 // Set unlocked_value bit.

so that you don't tied this comment to the implementation of the unlocked_value
being the lowest bit. I'm less worried about 'bit' versus 'bits' for this one.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9721:

> 9719:   assert_different_registers(obj, hdr, tmp);
> 9720: 
> 9721:   // Mark-word must be 00 now, try to swing it back to 01 (unlocked)

Perhaps:

  // Mark-word must be lock_mask now, try to swing it back to unlocked_value.

so that you don't tie this comment to the implementation values of
lock_mask and unlocked_value.

src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1717:

> 1715:       __ jcc(Assembler::notEqual, slow_path_lock);
> 1716:     } else {
> 1717:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1876:

> 1874:       __ dec_held_monitor_count();
> 1875:     } else {
> 1876:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 2187:

> 2185:       __ jcc(Assembler::notEqual, slow_path_lock);
> 2186:     } else {
> 2187:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 2331:

> 2329:       __ dec_held_monitor_count();
> 2330:     } else {
> 2331:       assert(LockingMode == LM_LIGHTWEIGHT, "");

Perhaps should be: s/""/"must be"/
I'm not fond of empty assert mesgs.

src/hotspot/share/interpreter/interpreterRuntime.cpp line 759:

> 757: // also keep the BasicObjectLock, but we don't really need it anyway, we only need
> 758: // the object. See also InterpreterMacroAssembler::lock_object().
> 759: // As soon as traditional stack-locking goes away we could remove the other monitorenter() entry

Perhaps:
s/traditional/legacy/
for terminology consistency...

src/hotspot/share/logging/logTag.hpp line 80:

> 78:   LOG_TAG(exceptions) \
> 79:   LOG_TAG(exit) \
> 80:   LOG_TAG(fastlock2) \

So why 'fastlock2'? Where's 'fastlock1'? Or 'fastlock'?

src/hotspot/share/oops/markWord.hpp line 175:

> 173:   }
> 174:   bool has_locker() const {
> 175:     assert(LockingMode == LM_LEGACY, "should only be called with traditional stack locking");

Perhaps:
s/traditional/legacy/
for terminology consistency...

src/hotspot/share/runtime/arguments.cpp line 1994:

> 1992:   if (UseHeavyMonitors) {
> 1993:     FLAG_SET_CMDLINE(LockingMode, LM_MONITOR);
> 1994:   }

HotSpot option processing has a general rule of last setting wins.
With L1992-1994 here, I think there might be a problem with a cmd
line that specifies:

    -XX:+UseHeavyMonitors -XX:LockingMode=1

I think that the resulting value of `LockingMode` will be `LM_MONITOR`
instead of `LM_LEGACY`. Granted mixing uses of `UseHeavyMonitors`
with `LockingMode` is just asking for trouble...

src/hotspot/share/runtime/globals.hpp line 1981:

> 1979:           "Select locking mode: "                                           \
> 1980:           "0: monitors only, "                                              \
> 1981:           "1: monitors & traditional stack-locking (default), "             \

Perhaps:
s/traditional/legacy/
to be consistent with terminolgy...

src/hotspot/share/runtime/javaThread.hpp line 1162:

> 1160: 
> 1161: 
> 1162:   static OopStorage* thread_oop_storage();

nit: delete extra blank line on L1161

src/hotspot/share/runtime/lockStack.cpp line 41:

> 39: LockStack::LockStack(JavaThread* jt) :
> 40:   _top(lock_stack_base_offset), _base()
> 41: {

nit: '{' on L41 should be at the end of L40 (after a space).

src/hotspot/share/runtime/lockStack.cpp line 63:

> 61: #ifndef PRODUCT
> 62: void LockStack::verify(const char* msg) const {
> 63:   assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when fast-locking is disabled");

Perhaps:
s/fast-locking/light weight locking/

src/hotspot/share/runtime/lockStack.cpp line 64:

> 62: void LockStack::verify(const char* msg) const {
> 63:   assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when fast-locking is disabled");
> 64:   assert((_top <=  end_offset()), "lockstack overflow: _top %d end_offset %d", _top, end_offset());

nit: extra space after `<=`

src/hotspot/share/runtime/lockStack.cpp line 65:

> 63:   assert(LockingMode == LM_LIGHTWEIGHT, "never use lock-stack when fast-locking is disabled");
> 64:   assert((_top <=  end_offset()), "lockstack overflow: _top %d end_offset %d", _top, end_offset());
> 65:   assert((_top >= start_offset()), "lockstack underflow: _topt %d end_offset %d", _top, start_offset());

nit typo: s/_topt/_top/

src/hotspot/share/runtime/lockStack.inline.hpp line 74:

> 72:   _base[to_index(_top)] = nullptr;
> 73: #endif
> 74:   assert(!contains(o), "entries must be unique");

Perhaps:

  assert(!contains(o), "entries must be unique: " PTR_FORMAT, p2i(o));

src/hotspot/share/runtime/lockStack.inline.hpp line 81:

> 79: inline void LockStack::remove(oop o) {
> 80:   verify("pre-remove");
> 81:   assert(contains(o), "entry must be present");

Perhaps:

  assert(contains(o), "entry must be present: " PTR_FORMAT, p2i(o));

src/hotspot/share/runtime/synchronizer.cpp line 506:

> 504:   if (!useHeavyMonitors()) {
> 505:     if (LockingMode == LM_LIGHTWEIGHT) {
> 506:       // Fast-locking does not use the 'lock' argument..

nit: extra period at the end.

src/hotspot/share/runtime/synchronizer.cpp line 1049:

> 1047: 
> 1048:   if (mark.has_monitor()) {
> 1049:     // inflated monitor so header points to ObjectMonitor (tagged pointer).

nit: s/inflated/Inflated/

src/hotspot/share/runtime/synchronizer.cpp line 1077:

> 1075: 
> 1076:   if (mark.has_monitor()) {
> 1077:     // inflated monitor so header points to ObjectMonitor (tagged pointer).

nit: s/inflated/Inflated/

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/Threads.java line 213:

> 211:     // refer to Threads::owning_thread_from_monitor_owner
> 212:     public JavaThread owningThreadFromMonitor(Address o) {
> 213:         assert(VM.getVM().getCommandLineFlag("LockingMode").getInt() != 2);

Please put a comment after that literal '2':

        assert(VM.getVM().getCommandLineFlag("LockingMode").getInt() != 2 /* LM_LIGHTWEIGHT */);

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/Threads.java line 231:

> 229: 
> 230:     public JavaThread owningThreadFromMonitor(ObjectMonitor monitor) {
> 231:         if (VM.getVM().getCommandLineFlag("LockingMode").getInt() == 2) {

Please put a comment after that literal '2':

        if (VM.getVM().getCommandLineFlag("LockingMode").getInt() == 2 /* LM_LIGHTWEIGHT */) {

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175779923
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175782883
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175838145
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175841805
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175841903
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175842779
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175843663
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175846203
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175845306
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175846631
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175847483
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1175847916
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176939876
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176943266
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176946313
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176950471
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176951363
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176952831
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176953184
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176960883
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176963809
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1176965777
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177061329
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177064110
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177066319
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177067974
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177069462
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177071139
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177071982
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177085020
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1177084574
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178061523
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178072667
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178076085
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178100632
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1178101434