Object to ObjectMonitor mapping using a HashTable to avoid displaced headers

Wed Feb 14 13:17:04 UTC 2024

Hi Axel,

Thanks for this loooooong write-up! This is very helpful to understand what you are doing in the OMWorld project. I have a few comments:

> ## High to Medium Level Overview of the Implementation
> 

This is very useful, and makes a lot of sense. Thank you!

> ### Inflation-Deflation Protocol
> 

Again very useful information.

> #### Spinning waiting on Deflation Thread progress
> 
> There is an unfortunate property here where the correctness requires progress of
> the deflation thread before any locking thread can make progress. The window
> where the deflation thread could suffer an unfortunate preemption is small, but
> it does exist. Currently if a locking thread fails to `inflate_and_enter`
> (due to deflation) it will attempt to fast lock again. Which means that at least
> one locking thread can always make progress if the deflation thread as
> transitioned away the mark word from `0b10` Monitor. So the window is between
> the deflation thread making an ObjectMonitor `is_being_async_deflated()` and
> transitioning the mark word away from `0b10` Monitor.

How bad is that? Have you ever observed this to be an actual problem? (It seems kind-of the reverse of what is currently done with the INFLATING protocol and doesn’t seem very much worse than that. Correct me if I’m wrong here.)

> There are schemas that can avoid this property but requires additional mark word
> state. Either another bit (want to avoid at all cost) or use the fourth unused
> state of the locking bits (`0b11` currently). The second would be an rather
> intrusive change (has been prototyped) because the JVM makes a lot of
> assumptions based on single bit states in the lock bits.
> 

Yes, and that combination of bit-states is not unused either. GCs use 0b11 to indicate a ‘forwarded’ state. This is not a problem for most GCs, because it is only a transitional state during GC at safepoint, but it may be a problem with Shenandoah where this state occurs outside of safepoints (albeit only in from-space). In-fact it might still work, because the ‘forwarded’ state would only be observed in from-space objects, which the locking code never sees (the barriers would ensure that only to-space objects are ever exposed to the runtime). It might work, if the 0b11 is only a transitional state and cannot be observed at safepoints (which would also break other GCs, btw). Then we could never accidentally leak 0b11 into from-space. But it is all quite hairy mess. We might want to avoid it, at least for the first shot, unless it solves a severe issue. Again, it looks kinda-like the reverse of the current INFLATING protocol, and doesn’t seem worse than that.

> ### ObjectMonitor Thread-Local Caches
> 

This is great!

> ## The Work Going Forward
> 
> ### Primary
> 
> Should be solved before integration.
> 
> #### OMCache
> 
> First is finalizing the size, behavior and layout of the Cache used by C2
> locking.
> 
> In the patch sent out last week the OMCache capacity is set to eight. And the
> actual size used can then be tuned from 0 to 8 using the `int OMCacheSize` flag.
> It is also possible to turn of the cache completely with the `bool OMUseC2Cache`
> flag to evaluate the affect no cache would have on a specific workload.
> 
> Different sizes have been tried, when running different benchmark suites
> (DaCapo, SPEC*, Renaissance, etc) between 3-5 seem to be required by at least
> some benchmark, and for those which had a lot of misses with 3-5 also had a lot
> of misses with even larger cache size.
> 

My gut feeling would have been no more than 4, perhaps even less, maybe just 1 (which would make the code much simpler). Rationale is that, how many OMs can you have that *are hot* at the same time on a given thread. If a single thread needs to deal with several nested monitors, then surely only the innermost 1-2 monitors would be hot?
 I am not sure that observed cache-misses is the most relevant metric, here. What’s relevant is how deep does the cache have to be to make a performance difference. Or maybe I am over-optimistic about how complex locking schemes can be ;-)

> Unrolling the cache scan. This was the initial approach, but just using a loop
> (which the patch that was sent out does) performed better in general. But there
> are many factors at work here, it may be worthwhile to introduce a flag so that
> loop unrolling the cache scan can be evaluated. If I recall correctly the
> initial unrolled cache lookup was when both fast_lock and fast_unlock used the
> same cache so it got unrolled in both the lock and unlock code. Now that only
> fast_lock uses this cache the result may be different.

Yeah. And seems only feasible if cache is small enough (~ <=4)

> The OMWorldCache is currently layed out as a struct of arrays, it should
> probably be layed out as an array of structs (especially if it turns out that a
> larger cache is preferable).
> 

Yeah. Array-of-structs would lead to a linear access pattern, which is often preferable. Also, we may want to make sure that the cache starts at a CPU cache line.

> Enabling `bool OMCacheHitRate` will emit hit rate counters in C2, they are
> logged to `monitorinflation=info` which can be used to evaluate specific
> workloads.

Great!

> #### OMWorld Hash Table Resizing
> 
> In the patch sent out last week the hash table resizing uses a very naive
> approach which simply delegates the deflation thread to grow the table if some
> other thread encounters a grow hint. Right now it exist to simply grow the table
> for those workloads which require a larger amount of monitors.
> 
> The first issue is the sizing, it is currently configured to start at a size of
> at least 1K and based on processor count and AvgMonitorsPerThreadEstimate (all
> of which are wild). And grows up to a size of 2M. And uses the default grow_hint
> of 4. Currently the table only grows, there is a flag `bool OMShrinkCHT` which
> can be used to enable a very WIP heuristic which also shrinks the table. However
> it is unclear if this is desirable, and how it would be designed in such a way
> that it avoids unnecessary oscillations. It would be very workload dependent.

Right. I would think that we do want to be able to shrink the table. I don’t think that we want to stick with a large table forever, only because we had an OM spike at some time (e.g. start-up). But I have no data yet to support that.

> The deflation thread is currently resizing while blocked in VM. That may not
> be appropriate.

Is there a way to do it concurrently?

> It would also be good to evaluate how good the default FastHashCode is. 

Uhhh, this reminds me: we have the reverse situation to what we had before, right? Before, I-hashing an object could force monitor inflation, now, monitor-inflation would force I-hashing. Is that correct? I don’t think that this is a problem, in my experience we have far less OM-locked objects than I-hashed objects. But it’s good to keep it in mind.

Are there any notably interaction in the inflation protocol when we have to manipulate both the lock-bits *and* the hash-bits? I don’t think there would be, but maybe you can confirm?

FWIW, looking forward, Lilliput is going to use a different hash-code:
https://github.com/openjdk/lilliput/blob/master/src/hotspot/share/utilities/fastHash.hpp

And looking forward even more, we are going to use a different I-hashing scheme where we don’t store the I-hash in the header at all, but grow the object on-demand (if necessary).

> #### Spinning
> 
> There are currently two places where a locking thread spins, waiting on some
> other thread to make progress.
> 
> The first is a fixed pre-spin if a fast locked Object is encountered. This will
> attempt to fast lock the object for a fixed number of iterations, the first
> `int OMSpins` (tunable flag) number of iterations will be separated by a
> `SpinPause()` and the last `int OMYields` (tunable flag) number of iterations
> will be separated by an `os::naked_yield()`. The idea is to avoid inflating if
> the critical sections are are on the same order of time as the time cost for
> inflating, in such a case the  thread would do better to just wait and fast lock
> instead of inflating.
> 

This sounds useful! I assume this is only done in the runtime, and fast-paths call into the runtime on first CAS-failure?

> The current numbers for the flags were chosen by looking at programs which
> exhibited this behavior.
> 
> The `monitorinflation=info` logging will print a per thread inflation cause
> which can be used to evaluate the behavior of different programs. The following
> causes are counted.
> ```
>  Mon: Number of inflations from an `0b01` Unlocked state.
>  Rec: Number of unbalanced or interleaved recursions.
> CRec: Number of recursive enter which inflated (not due to the cause above).
> Cont: Number of contended inflations experienced (during exit, wait, notify).
> Wait: Number of inflations cause by wait.
> Stack: Number of inflations caused because of a full lock stack.
> ```
> 
> A workload dominated by `Mon:` benefits from this fixed-pre-spin. The `Mon:`
> cause means that Object was unlocked during the inflation.

Nice!

> The second spinning that occurs is the one where for correctness a locking
> thread must wait for the deflation thread to make progress. It currently does
> two iterations each followed by an `os::naked_yield` and for the rest of the
> iterations it uses `os::naked_short_nanosleep()`. There are multiple issues
> here. This can become an issue on an over-provisioned machine, which may make
> the deflation threads progress flaky. In the implementation details above I have
> discussed potential a solution to how we can make sure that at least one thread
> guaranteed progress. However on an over-provisioned machine even if one thread
> is guaranteed to get a lock, such a thread might just also be scheduled out and
> never get to run (while holding the lock).
> 
> Regardless of the progress issue, the issue with time to safepoint needs to be
> addressed. A locking thread that has been blocked by deflation for two
> iterations should also transition to blocked in VM for the sleep duration and
> participate in the safepoint protocol.

Uhhh, ok. Wait a second, aren’t threads that want to inflate a monitor already at a safepoint?

> #### Performance
> 
> The fact that a performance evaluation is required for integration is a
> no-brainer, however the exact requirements are harder to pin down. The nature of
> the different LockingModes in HotSpot is such that they all have different
> workloads for which they are best suited. This could be seen last year when
> the default was temporarily switched to LM_LIGHTWEIGHT and there were a lot of
> movement in the promoted performance runs. A lot of regressions, but there were
> also workloads which showed improvement with LM_LIGHTWEIGHT over LM_LEGACY.
> 
> Most of the performance work with OMWorld has been trying to achieve parity with
> LM_LEGACY, however there are still known workloads where LM_LEGACY is better,
> and also known workloads where OMWorld is better. Then there are all the unknown
> workloads for which OMWorld has not been tested.
> 
> All I am trying to say is that this is probably one of the issues which will be
> the hardest to figure out.
> 
> At least now with OMWorld as part of the Lilliput project a more holistic
> evaluation can be attempted, where maybe the whole can be greater than the sum
> of its parts. (This includes smaller headers, OMWorld and stable klass pointers,
> GC algorithms without forwarding pointers in the object header, loom support for
> synchronized without pointers from the java heap into thread stacks).

Indeed! I did some initial performance runs with OMWorld/Lilliput and found that most workloads show parity. I’ve seen some workloads benefit a lot (>10% improvement) but I strongly suspect this is because of stable class-pointers. I have not yet dived deeper. I figured that for now it would be more useful to make an apples-to-apples comparison. For this, could you, once the recursive-LW stuff is all pushed, update/merge latest jdk into your jdk-based branch?

> ### Secondary
> 
> Should be understood before integration.
> 
> #### Deflation Heuristics
> 
> The current heuristics that drive the deflation thread does not seem like a good
> fit for how OMWorld works. In OMWorld the `AvgMonitorsPerThreadEstimate` is
> disabled unless unless it is configured via the command line. Instead the
> `_in_use_list_ceiling` is grown based on the number of actual inflations. There
> is pre-existing weirdness in this heuristics.
> 
> Currently, for most observed workloads, deflation is driven by the
> `GuaranteedAsyncDeflationInterval` flag.

I’ve lately been thinking that concurrent deflation shares the outpaced-by-the-program property that concurrent GCs suffer. In theory, a sufficiently aggressive  workload could keep generating more OMs than deflation could free up. We might actually have observed this already when running workloads without LW recursive locking.

> #### C2 Fast Lock / Fast Unlock
> 
> The fast path in the C2 code is where the recursive lightweight initially came
> from. Many alternative implementations have been evaluated. The only one which
> has been somewhat promising is one which tries to make LM_LIGHTWEIGHT more like
> LM_LEGACY by making recursive fast lock enters always take a fast lock exit, by
> setting the a value in the BasicLock that signals to the exit paths that a
> recursive fast lock  enter was performed and it can simply exit. This also
> extended the first fast lock enter to save the previous lock stack top along
> with a signal (differentiating it from an ObjectMonitor), which enabled the fast
> lock exit paths to restore the top value without loading anything from the lock
> stack. Effectively the exit could decode exactly what the enter did by reading
> the BasicLock from the stack.
> 

Nice! I’ve been hoping to get rid of BasicLock, but if it’s useful than we may just keep it.

> And lastly most platforms in HotSpot only emitted JITed inflated enter exit code
> for C2, even tough both C1 and the native wrapper share the same property of
> only performing balanced locking.

Right. At least C1 should perhaps also do OM-locking.

> #### C2 Compilation Failures
> 
> During performance testing two issues came up. For some benchmarks there is
> randomness which determines if the hot code of the program gets C1 or C2
> compiled.
> 
> And some trivial micro-benchmarks using synchronized would cause both C1 and C2
> to not compile and run completely in the interpreter.
> 

Ugh. Any idea why that is so?

> 
> ## Miscellaneous
> 
> This will be some concluding words as well as some discussion on Why
> LM_PLACEHOLDER
> 
> ### Why LM_PLACEHOLDER
> 
> First LM_PLACEHOLDER is exactly that, a placeholder for the final locking mode,
> be that called LM_LIGHTWEIGHT, or something else.
> 
> This was mainly because many discussions and changes regarding LM_LIGHTWEIGHT
> stalled OMWorld development, and partially a defensive measure to not have
> changes to OMWorld stall any other projects with regards to LM_LIGHTWEIGHT.
> There is synchronized support for loom on the way, which should fit in
> seamlessly with both LM_LIGHTWEIGHT and LM_PLACEHOLDER, but there are incentives
> to use a lock stack based locking mode with these loom changes rather than a
> LM_LEGACY based one.
> 
> One unintended benefit in the short term is that it will be a little easier to
> compare and identify the different benefits and drawbacks between LM_LIGHTWEIGHT
> and LM_PLACEHOLDER in Lilliput. (Obviously a maintenance burden in the long run,
> but both seem rather stable in the Lilliput repo).
> 
> The ambition is to harmonize the LockingModes and be able to finally reduce them
> to one or two, which would be a maintenance win for the whole of HotSpot.
> 

Indeed! Thanks for the insights.

> 
> ### Final Words
> 
> If you read this whole thing in one go, my hats of too you.
> 
> I hope that this can serve as reference point which can be used to get more
> contributors engaged in the project. And that it can be an actionable stepping
> stone for bringing the Lilliput project closer to integration.

Again, thank you for that! It’s been very helpful to understand and get an idea of the scope of the project.

I would suggest to break up your efforts into tasks that are absolutely essential, and tasks/ideas that can be done after the first big thing has landed and stabilized. This would follow pretty much the structure of this document - focus on primary stuff first, and do most of the rest later. It’s just so hard to handle so many moving parts at the same time.

I’ll do more studying and testing and evaluations and get back to you when more questions arise.

Thanks and best regards!
Roman

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879