RFR: 8291555: Implement alternative fast-locking scheme [v5]

Fri Jan 27 23:31:21 UTC 2023

On Thu, 26 Jan 2023 12:34:18 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change adds a fast-locking scheme as an alternative to the current stack-locking implementation. It retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation when trying to access the mark-word. And because of the very racy nature, this turns out to be very complex and would involve a variant of the inflation protocol to ensure that the object header is stable. (The current implementation of setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations (and we already do). The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for potential overflow before attempting locking. When that is the case, locking fast-paths would call into the runtime to grow the stack and handle the locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR
>> 
>> 
>> Testing:
>>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>>  - [x] Several real-world applications have been tested with this change in tandem with Lilliput without any problems, yet
>> 
>> ### Performance
>> 
>> #### Renaissance
>> 
>> 
>> 
>>   | x86_64 |   |   |   | aarch64 |   |  
>> -- | -- | -- | -- | -- | -- | -- | --
>>   | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
>> AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
>> Reactors | 11444.511 | 11606.66 | -1.40% |   | 11382.594 | 11638.036 | -2.19%
>> Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
>> ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
>> GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
>> LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
>> MovieLens | 8248.116 | 8314.576 | -0.80% |   | 9898.1 | 10097.867 | -1.98%
>> NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
>> PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
>> FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
>> FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | -0.12%
>> ParMnemonics | 2527.228 | 2564.667 | -1.46% |   | 4278.225 | 4263.863 | 0.34%
>> Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
>> RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
>> Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
>> ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
>> ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
>> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | -0.37%
>> Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 0.97%
>> FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | -0.39%
>> FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%
>
> Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 86 commits:
> 
>  - Merge branch 'master' into JDK-8291556-v2
>  - Revert UseFastLocking to default to off
>  - Change log message inflate(locked) -> inflate(has_locker)
>  - Properly set ZF on anon-check path; avoid some conditional branches
>  - Use testb when testing for anon owner in fast-path
>  - Merge branch 'master' into JDK-8291555-v2
>  - In x86_32 C2 fast_lock(), CAS thread directly, instead of CASing stack-pointer and then storing thread
>  - x86 part of optimization to check for anon owner
>  - Optimize the check-for-anon-owner fast-path
>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>  - ... and 76 more: https://git.openjdk.org/jdk/compare/da80e7a4...784b361c

This is a partial review of this PR. I've only reviewed the following files so far:

src/hotspot/share/oops/markWord.hpp
src/hotspot/share/oops/oop.cpp
src/hotspot/share/runtime/javaThread.hpp
src/hotspot/share/runtime/lockStack.cpp
src/hotspot/share/runtime/lockStack.cpp
src/hotspot/share/runtime/objectMonitor.cpp
src/hotspot/share/runtime/synchronizer.cpp
src/hotspot/share/runtime/thread.cpp
src/hotspot/share/runtime/threads.cpp
src/hotspot/share/runtime/threads.hpp

I've also created a patch with editorial changes. Please see:
https://github.com/openjdk/jdk/pull/12271

src/hotspot/share/runtime/synchronizer.cpp line 1301:

> 1299:     // We could always eliminate polling by parking the thread on some auxiliary list.
> 1300:     // NOTE: We need to check UseFastLocking here, because with fast-locking, the header
> 1301:     // may legitimately be zero: cleared lock-bits and all upper header bits zero.

The `markWord::INFLATING()` value was picked to be zero because the header could never
be zero except for the inflating case. With fast-locking, in the locked case with no hashcode,
the header can be all zeros. Hmmm... gotta mull on that one...

fast-locking does not use the `markWord::INFLATING()`, but that protocol exists with the
stack-locking implementation to prevent races. Here's the gory comment from the
"Case stack-locked" portion of `ObjectSynchronizer::inflate()`:

      // We've successfully installed INFLATING (0) into the mark-word.
      // This is the only case where 0 will appear in a mark-word.
      // Only the singular thread that successfully swings the mark-word
      // to 0 can perform (or more precisely, complete) inflation.
      //
      // Why do we CAS a 0 into the mark-word instead of just CASing the
      // mark-word from the stack-locked value directly to the new inflated state?
      // Consider what happens when a thread unlocks a stack-locked object.
      // It attempts to use CAS to swing the displaced header value from the
      // on-stack BasicLock back into the object header.  Recall also that the
      // header value (hash code, etc) can reside in (a) the object header, or
      // (b) a displaced header associated with the stack-lock, or (c) a displaced
      // header in an ObjectMonitor.  The inflate() routine must copy the header
      // value from the BasicLock on the owner's stack to the ObjectMonitor, all
      // the while preserving the hashCode stability invariants.  If the owner
      // decides to release the lock while the value is 0, the unlock will fail
      // and control will eventually pass from slow_exit() to inflate.  The owner
      // will then spin, waiting for the 0 value to disappear.   Put another way,
      // the 0 causes the owner to stall if the owner happens to try to
      // drop the lock (restoring the header from the BasicLock to the object)
      // while inflation is in-progress.  This protocol avoids races that might
      // would otherwise permit hashCode values to change or "flicker" for an object.
      // Critically, while object->mark is 0 mark.displaced_mark_helper() is stable.
      // 0 serves as a "BUSY" inflate-in-progress indicator.

So how does fast-locking avoid hashCode value flickering for an object
when there are races between exiting an monitor and inflation? With
fast-locking the owner of the monitor does not stall due to an INFLATING
value being present in the header so the owner thread will race with the
inflating thread. However, unlike with stack-locking, the owner thread in
fast-locking is not restoring a saved header value from the BasicLock to
the object's header; it is simply changing the `locked_value` to the
`unlocked_value` in the lower two bits. If there's a hashCode in the header,
then that hashCode remains untouched.

The other callers of `read_stable_mark()`:

`ObjectSynchronizer::FastHashCode()` - Does not need to stall in `read_stable_mark()`
due to a racing inflation because the hashCode value will either be found in the header
or in the ObjectMonitor and it will be the same value in both locations.

`ObjectSynchronizer::current_thread_holds_lock()` - Does not need to stall in
`read_stable_mark()` due to a racing inflation because the current thread
either owns the lock or it does not and that state cannot change while the
current thread is executing this function.

`ObjectSynchronizer::get_lock_owner()` - I think this function does need to stall
in `read_stable_mark()` due to a racing inflation. If the calling thread in this
case gets a header value where `mark.is_fast_locked() == true`, then it will
search the ThreadsList for the thread that has the object on its lock stack.
If thread that's inflating the lock is the owner of the lock, then it will remove
the object from its lock stack after it has changed the owner in the ObjectMonitor
to itself. If that happens before the `get_lock_owner()` calling thread reaches
the owning thread in its ThreadsList search, then the `get_lock_owner()` calling
thread won't find the owner of the lock.

If the `get_lock_owner()` calling thread happens to "know" that the lock is
supposed to be owned by _someone_, then the inconsistency could be detected.
It might even be possible to write a test to detect this. I'll mull on that a bit...

`ObjectSynchronizer::inflate()` - Only calls read_stable_mark() when stack-locking
is in use and the `INFLATING()` value is seen.

Otherwise, `object->mark_acquire()` is used for the read of the header at the top
of the `inflate()` loop and all the code that updates the object's mark use
`cas_set_mark()` to only change the object's mark when the current value matches
the expected current value. Otherwise, we loop around and try again.

src/hotspot/share/runtime/synchronizer.cpp line 1336:

> 1334:         // Success! Return inflated monitor.
> 1335:         if (own) {
> 1336:           assert(current->is_Java_thread(), "must be: checked in is_lock_owned()");

`is_lock_owned()` currently does this:

static bool is_lock_owned(Thread* thread, oop obj) {
  assert(UseFastLocking, "only call this with fast-locking enabled");
  return thread->is_Java_thread() ? reinterpret_cast<JavaThread*>(thread)->lock_stack().contains(obj) : false;
}

so I would not say "checked in is_locked_owned()" since `is_locked_owned()` does
not enforce that the caller is a JavaThread.

-------------

Changes requested by dcubed (Reviewer).

PR: https://git.openjdk.org/jdk/pull/10907