Object to ObjectMonitor mapping using a HashTable to avoid displaced headers

Sat Feb 10 10:48:30 UTC 2024

Hello Lilliput,

# OMWorld status

I had planned to send this out earlier this week, but got sidetracked by other
issues, and then this email grew a bit larger than I originally imagined it
would be.

The email is split into three sections.

The first section gives a high to medium level overview of the implementation of
OMWorld.

The second section gives my view of what the work going forward will require
with respect to OMWorld and Lilliput.

The third and last section will be a miscellaneous section for some things that
did not fit in the other two, mainly Why LM_PLACEHOLDER (because I know it has
been asked).

This mail turned out to be rather lengthy, unless you want or feel the need to
check the implementation details first. I recommend starting with the Primary
and Secondary issues in the second section `The Work Going Forward` and then
refer back to the first section if needed.

## High to Medium Level Overview of the Implementation

The main goal with OMWorld is to provide a mechanism for associating an
ObjectMonitor with an Object and discovering such an association in a way that
all threads and contexts agree upon which ObjectMonitor to use without
disturbing the non-lock-bits of the mark word.

The association between Objects and ObjectMonitors in OMWorld is done by entries
in a hash table, instead of via a pointer in the Objects header.

A lot of work has also been done to reduce the reasons for having to inflate
ObjectMonitors, some of the work, like not inflating for hash codes when lock
stack based fast locking is used, has already been pushed to mainline. And some
are open PRs like adding recursive locking support for lock stack based fast
locking. The details of those changes will not be explained in this mail.

### Inflation-Deflation Protocol

The first fundamental change is the inflation-deflation protocol. In OMWorld an
ObjectMonitor is created (inflated) if and only if the Object that it will be
associated with is already fast locked by the inflating thread (one exception
with `relock_object` see note below) or that the inflating thread is in the
process of performing a monitorenter.

This change is made to solve the lose of atomicity when a thread publishes a
newly created ObjectMonitor. For reference, LM_LIGHTWEIGHT uses a CAS to publish
the ObjectMonitor directly in the mark word, and LM_MONITOR/LM_LEGACY uses a
spin loop on the mark word with sentinel inflating mark value. Both of these
approaches allows for inflating from a context which is not locked, nor
preforming a monitorenter.

In OMWorld, the fact that inflation is limited to an Object that is already
locked, or is about to be entered allows for constraining the created and
published ObjectMonitors to be created in a locked state. Any newly created
ObjectMonitor is created and published to the rest of the system with an
'anonymous owner'. Such an ObjectMonitor will will be observed as locked by any
observer. Publishing the ObjectMonitor here entails successfully inserting it
into the hash table.

The behavior of each inflating thread then depends on the observed value of the
mark word and the successful transition of the lock bits (successful CAS).

Here is a breakdown of the possible observed lock bits and successful
transitions:
 * `0b10` Monitor is observed.
   * If the locking thread has the object on its lock stack, claim ownership of
     the ObjectMonitor and remove the object from its lock stack.
   * Lock with ObjectMonitor::enter
 * `0b00 -> 0b10` CAS from Fast Locked to Monitor
   * If the locking thread has the object on its lock stack, claim ownership of
     the ObjectMonitor and remove the object from its lock stack.
   * Lock with ObjectMonitor::enter
 * `0b01 -> 0b10` CAS from Unlocked to Monitor
   * The locking thread claims the ownership of the ObjectMonitor
   * ObjectMonitor is successfully locked.
 * On a failed CAS, retry until one of the three above occurs.

We must also allow for a thread which is entered on a Fast Locked Object to
inflate and associate an ObjectMonitor with that specific Object. In this case
the object being on the locking thread's lock stack is a prerequisite and the
only valid observations and transitions for the locking thread are:
 * `0b10` Monitor is observed.
   * Claim ownership of the ObjectMonitor and remove the object from its lock
     stack.
 * `0b00 -> 0b10` CAS from Fast Locked to Monitor
   * Claim ownership of the ObjectMonitor and remove the object from its lock
     stack.

A simplified pseudo code of the inflation.
```C++
void inflate_and_enter(oop object, JavaThread* locking_thread) {
  ObjectMonitor* monitor = get_or_insert_monitor(object);
  LockStack& lock_stack = locking_thread->lock_stack();
  for (;;) {
    const markWord mark = object->mark_acquire();
    if (mark.has_monitor()) {
      if (lock_stack.contains(object)) {
        monitor->set_owner_from_anonymous(locking_thread);
        lock_stack.remove(object);
      }
      break;
    }

    if (mark.is_fast_locked()) {
      if (mark != object->cas_set_mark(mark.set_has_monitor(), mark)) {
        continue; // Retry
      }
      if (lock_stack.contains(object)) {
        monitor->set_owner_from_anonymous(locking_thread);
        lock_stack.remove(object);
      }
      break;
    }

    if (mark != object->cas_set_mark(mark.set_has_monitor(), mark)) {
      continue; // Retry
    }
    monitor->set_owner_from_anonymous(locking_thread);
    return; // Succesfully locked.
  }
  monitor->enter(locking_thread);
}

ObjectMonitor* inflate_fast_locked_object(oop object, JavaThread* locking_thread) {
  ObjectMonitor* monitor = get_or_insert_monitor(object);
  LockStack& lock_stack = locking_thread->lock_stack();
  precond(lock_stack.contains(object));

  markWord mark = object->mark_acquire();
  while (mark.is_fast_locked()) {
    mark = object->cas_set_mark(mark.set_has_monitor(), mark);
  }
  monitor->set_owner_from_anonymous(locking_thread);
  lock_stack.remove(object);

  return monitor;
}
```

In the absence of deflation this mechanism allows for concurrent threads which
observe an object with the lock bits `0b10` (Monitor), to also be able to
retrieve the associated ObjectMonitor from the hash table. It also allows for
concurrent inflating and entering/entered thread to agree on which ObjectMonitor
to use as only one specific instance of an ObjectMonitor will successfully be
inserted in the hash table.

As soon as deflation and memory reclamation gets it the picture things get more
complicated.

The memory reclamation and use after free issue is handled the same way as it is
already done, via a two phased, unlink from the system, handshake, purge from
the system. (One Deflation Cycle)

As for making sure that concurrent threads agree on which ObjectMonitor is
associated with which Object, more care has to be taken. And be wary ABA
problems with the lock bits.

We achieve this by changing the inflation-deflation protocol as follows:
 * Deflation Thread
   * Will always perform these operations in this order:
     * Make an ObjectMonitor `is_being_async_deflated()`, irreversible action
     * Transition the mark word away from `0b10` Monitor
     * Disassociate the ObjectMonitor from the object, by removing the hash
       table entry
 * Locking thread
   * May only transition the mark word to `0b10` Monitor if an associated
     ObjectMonitor is observed where `is_being_async_deflated()` is not true and
     stable across the transition.

The agreement between threads between threads is achieved because only the
deflation thread disassociate ObjectMonitor, and it only does so after it is
marked as `is_being_async_deflated()`. Locking threads cannot insert a new
association while another one exists. So the locking threads will effectively
have to wait until the deflation thread has finished. Sans any ABA problems with
the lock bits, the scenario after having waited for deflation to finish is the
same scenario as before deflation existed.

The ABA problem with the lock bits is solved by locking out the deflation thread
by forcing the `is_being_async_deflated()` to be stable, and only the deflation
thread transitions may the mark word away from `0b10` monitor. So when a locking
thread retrieves an associated  ObjectMonitor, it will either be
`is_being_async_deflated()`, in which case the thread waits until a new monitor
can be associated, or it will not be `is_being_async_deflated()`, in which case
the locking thread can safely transition the mark word to `0b10` Monitor without
ABA problems, as the deflation thread cannot have transitioned the mark word
away from `0b10` Monitor because it must first have set
`is_being_async_deflated()` which is being forced stable during the transition.
So when a locking thread transitions the mark word to `0b10` Monitor it knows
that the ObjectMonitor it is about to enter on will still be the associated
ObjectMonitor, because it cannot have been deflated, and thus not been
disassociated from the Object.

The `inflate_and_enter` function is now changed to return false if it fails
because of deflation.

> Note: In the patch that was sent out last week the ABA issue exists, in some
>       later iteration an invalid optimization was added where a locking thread
>       transitions the mark word away from `0b10` Monitor. A good side effect
>       of doing this writeup is challenging assumptions when having to explain
>       why the code works.
>       It is extremely racy and requires inserting sleeps in the code to
>       reliably be provoked.

Extended pseudo code of inflate_and_enter to handle deflation-inflate.
```C++
bool inflate_and_enter(oop object, JavaThread* locking_thread) {
  ObjectMonitor* monitor = get_or_insert_monitor(object);
  LockStack& lock_stack = locking_thread->lock_stack();

  // Holds is_being_async_deflated() stable throughout this function.
  ObjectMonitorContentionMark mark(monitor);

  if (monitor->is_being_async_deflated()) {
    // The deflation thread must remove the monitor from the hash table.
    return false; // Retry
  }
  for (;;) {
    const markWord mark = object->mark_acquire();
    if (mark.has_monitor()) {
      if (lock_stack.contains(object)) {
        monitor->set_owner_from_anonymous(locking_thread);
        lock_stack.remove(object);
      }
      break;
    }

    if (mark.is_fast_locked()) {
      if (mark != object->cas_set_mark(mark.set_has_monitor(), mark)) {
        continue; // Retry
      }
      if (lock_stack.contains(object)) {
        monitor->set_owner_from_anonymous(locking_thread);
        lock_stack.remove(object);
      }
      break;
    }

    if (mark != object->cas_set_mark(mark.set_has_monitor(), mark)) {
      continue; // Retry
    }
    monitor->set_owner_from_anonymous(locking_thread);
    return true; // Succesfully locked.
  }
  // Should always return true as is_being_async_deflated() is kept stable.
  return monitor->enter(locking_thread);
}
```

As for `inflate_fast_lock_object` it follows the same principle but the
stability of `is_being_async_deflated()` is ensured because the Object is locked
throughout the call and cannot be deflated. It must only be extended with the
waiting for the deflation thread to disassociate any deflated ObjectMonitor that
may exist.

Extended pseudo code for inflate_fast_locked_object.
```C++
ObjectMonitor* inflate_fast_locked_object(oop object, JavaThread* locking_thread) {
  ObjectMonitor* monitor = get_or_insert_monitor(object);
  LockStack& lock_stack = locking_thread->lock_stack();
  precond(lock_stack.contains(object));

  while (monitor->is_being_async_deflated()) {
    monitor = get_or_insert_monitor(object);
  }

  markWord mark = object->mark_acquire();
  while (mark.is_fast_locked()) {
    mark = object->cas_set_mark(mark.set_has_monitor(), mark);
  }
  monitor->set_owner_from_anonymous(locking_thread);
  lock_stack.remove(object);

  return monitor;
}
```

#### relock_object Exception

`relock_object` will use both of these functions from another thread than the
locking_thread, however the locking_thread must be suspended throughout, and for
the sake of correctness it can be treated as if it is the locking_thread
performing the operations.

#### Spinning waiting on Deflation Thread progress

There is an unfortunate property here where the correctness requires progress of
the deflation thread before any locking thread can make progress. The window
where the deflation thread could suffer an unfortunate preemption is small, but
it does exist. Currently if a locking thread fails to `inflate_and_enter`
(due to deflation) it will attempt to fast lock again. Which means that at least
one locking thread can always make progress if the deflation thread as
transitioned away the mark word from `0b10` Monitor. So the window is between
the deflation thread making an ObjectMonitor `is_being_async_deflated()` and
transitioning the mark word away from `0b10` Monitor.

There are schemas that can avoid this property but requires additional mark word
state. Either another bit (want to avoid at all cost) or use the fourth unused
state of the locking bits (`0b11` currently). The second would be an rather
intrusive change (has been prototyped) because the JVM makes a lot of
assumptions based on single bit states in the lock bits.

The idea would be to have the deflation thread first transition the mark word to
the new state, which has the exact same meaning as `0b10` Monitor, and then
attempt deflation, switching back to `0b10` Monitor if it fails. Then allow any
thread to transition the mark word from this new state to `0b01` unlocked (the
deflation thread does not disassociate the ObjectMonitor until it have observed
a transition away from this new state). Because exactly one transition to and
from this new state would occur per deflation cycle (cycles are separated by a
handshake) then the ABA problem would be avoided because no thread would be
stuck trying to transition the mark word away from this new state based on
information from a previous cycle.

### ObjectMonitor Thread-Local Caches

It is not feasible to perform a lookup in our ConcurrentHashTable from the
emitted C2 code, instead to facilitate fast retrieval of an associated
ObjectMonitor from the C2 code without calling into the runtime Thread-Local
Caches are used.

First there is a cache which lives in the JavaThread* native object, which is
used when locking. This cache currently exists of `(oop, ObjectMonitor*)` pairs.
Insertions evicts the least-recently used pair when cache is full (of non stale
entries). A entry is inserted whenever a locking thread successfully enters on
an associated ObjectMonitor it the runtime. C2 fast_lock will linearly scan this
cache from most to least-recently used. On a cache hit it will attempt to enter
on the ObjectMonitor. On a cache miss it will call into the runtime instead. The
entries are cleared out on each safepoint to not artificially extend the
lifetime of Objects (we do not want the GC to treat these objects as thread
roots). The entries are also cleared out once per deflation cycle. The only
reason a cache entry grows stale is when an ObjectMonitor has been deflated. Any
deflated monitor must have its owner be DEFLATER_MARKER, which will cause any
locking that C2 attempts to call into the runtime.

And second there is a cache which lives on the thread stack, and is used when
unlocking. The BasicLock slot is used to cache the ObjectMonitor when it is
entered upon. Because of the symmetry between lock and unlock, the fast_unlock
can then simply read out the ObjectMonitor from this cache and will always get a
cache hit if it also locked on the inflated monitor. Each moniorexit will only
miss at most once per deflation cycle. This cache does not require any cleanup
nor maintenance as nothing can make it stale (no deflation can occur while the
ObjectMonitor is entered). However care needs to be taken (just as is the case
with LM_LEGACY) that the BasicLock on the stack is initialized to a known value
(nullptr), as the fast_unlock path will treat the value as an `ObjectMonitor*`.

## The Work Going Forward

This section will describe some of the work still required for OMWorld. I will
split this into three different catagories, Primary, Secondary and Tertiary
issues. Where the Primary issues are those I believe should be worked on next
and require a solution prior to any integration. The Secondary issues are those
that should at least be  understood prior to any integration. And the Tertiary
issues are those which if time permits would be worth investigating.

### Primary

Should be solved before integration.

#### OMCache

First is finalizing the size, behavior and layout of the Cache used by C2
locking.

In the patch sent out last week the OMCache capacity is set to eight. And the
actual size used can then be tuned from 0 to 8 using the `int OMCacheSize` flag.
It is also possible to turn of the cache completely with the `bool OMUseC2Cache`
flag to evaluate the affect no cache would have on a specific workload.

Different sizes have been tried, when running different benchmark suites
(DaCapo, SPEC*, Renaissance, etc) between 3-5 seem to be required by at least
some benchmark, and for those which had a lot of misses with 3-5 also had a lot
of misses with even larger cache size.

Unrolling the cache scan. This was the initial approach, but just using a loop
(which the patch that was sent out does) performed better in general. But there
are many factors at work here, it may be worthwhile to introduce a flag so that
loop unrolling the cache scan can be evaluated. If I recall correctly the
initial unrolled cache lookup was when both fast_lock and fast_unlock used the
same cache so it got unrolled in both the lock and unlock code. Now that only
fast_lock uses this cache the result may be different.

The OMWorldCache is currently layed out as a struct of arrays, it should
probably be layed out as an array of structs (especially if it turns out that a
larger cache is preferable).

Enabling `bool OMCacheHitRate` will emit hit rate counters in C2, they are
logged to `monitorinflation=info` which can be used to evaluate specific
workloads.

This issue is completely performance driven.

#### OMWorld Hash Table Resizing

In the patch sent out last week the hash table resizing uses a very naive
approach which simply delegates the deflation thread to grow the table if some
other thread encounters a grow hint. Right now it exist to simply grow the table
for those workloads which require a larger amount of monitors.

The first issue is the sizing, it is currently configured to start at a size of
at least 1K and based on processor count and AvgMonitorsPerThreadEstimate (all
of which are wild). And grows up to a size of 2M. And uses the default grow_hint
of 4. Currently the table only grows, there is a flag `bool OMShrinkCHT` which
can be used to enable a very WIP heuristic which also shrinks the table. However
it is unclear if this is desirable, and how it would be designed in such a way
that it avoids unnecessary oscillations. It would be very workload dependent.

The deflation thread is currently resizing while blocked in VM. That may not
be appropriate.

It would also be good to evaluate how good the default FastHashCode is. There
were times when the grow_hint was triggered at low load factors (0.1-0.2). Are
these statistical anomalies or something inherent to our hash codes?
This last point is more of a Secondary issue, but it is probably required for
understanding and creating a resizing policy.

#### Spinning

There are currently two places where a locking thread spins, waiting on some
other thread to make progress.

The first is a fixed pre-spin if a fast locked Object is encountered. This will
attempt to fast lock the object for a fixed number of iterations, the first
`int OMSpins` (tunable flag) number of iterations will be separated by a
`SpinPause()` and the last `int OMYields` (tunable flag) number of iterations
will be separated by an `os::naked_yield()`. The idea is to avoid inflating if
the critical sections are are on the same order of time as the time cost for
inflating, in such a case the  thread would do better to just wait and fast lock
instead of inflating.

The current numbers for the flags were chosen by looking at programs which
exhibited this behavior.

The `monitorinflation=info` logging will print a per thread inflation cause
which can be used to evaluate the behavior of different programs. The following
causes are counted.
```
  Mon: Number of inflations from an `0b01` Unlocked state.
  Rec: Number of unbalanced or interleaved recursions.
 CRec: Number of recursive enter which inflated (not due to the cause above).
 Cont: Number of contended inflations experienced (during exit, wait, notify).
 Wait: Number of inflations cause by wait.
Stack: Number of inflations caused because of a full lock stack.
```

A workload dominated by `Mon:` benefits from this fixed-pre-spin. The `Mon:`
cause means that Object was unlocked during the inflation.

The second spinning that occurs is the one where for correctness a locking
thread must wait for the deflation thread to make progress. It currently does
two iterations each followed by an `os::naked_yield` and for the rest of the
iterations it uses `os::naked_short_nanosleep()`. There are multiple issues
here. This can become an issue on an over-provisioned machine, which may make
the deflation threads progress flaky. In the implementation details above I have
discussed potential a solution to how we can make sure that at least one thread
guaranteed progress. However on an over-provisioned machine even if one thread
is guaranteed to get a lock, such a thread might just also be scheduled out and
never get to run (while holding the lock).

Regardless of the progress issue, the issue with time to safepoint needs to be
addressed. A locking thread that has been blocked by deflation for two
iterations should also transition to blocked in VM for the sleep duration and
participate in the safepoint protocol.

#### Cleanups

The patch contains a number of features/questions which are not mentioned above,
usually they have some TODO comment attached to them. Some will be mentioned in
Secondary or Tertiary issues. These needs to be cleaned out and/or finished
depending on the fallout of said issues.

LM_PLACEHOLDER also belongs under this section but it will be be discussed in
the miscellaneous section `Why LM_PLACEHOLDER`.

The runtime entry for locking and unlocking are also not very clean. It stems
from the lack of registers on x86_32, leading to that the BasicLock register
loses the BasicLock*. In LM_LIGHTWEIGHT this was unused so a second entry to the
runtime was created and the BasicLock* is not provided. As this is now used for
the unlock cache it becomes important again (for performance, not correctness)
to provide the BasicLock. The current patch only patches up and forwards this on
64-bit platforms, however it would probably be better to simply reload the
BasicLock on x86_32 and have one uniform entry point.

#### Performance

The fact that a performance evaluation is required for integration is a
no-brainer, however the exact requirements are harder to pin down. The nature of
the different LockingModes in HotSpot is such that they all have different
workloads for which they are best suited. This could be seen last year when
the default was temporarily switched to LM_LIGHTWEIGHT and there were a lot of
movement in the promoted performance runs. A lot of regressions, but there were
also workloads which showed improvement with LM_LIGHTWEIGHT over LM_LEGACY.

Most of the performance work with OMWorld has been trying to achieve parity with
LM_LEGACY, however there are still known workloads where LM_LEGACY is better,
and also known workloads where OMWorld is better. Then there are all the unknown
workloads for which OMWorld has not been tested.

All I am trying to say is that this is probably one of the issues which will be
the hardest to figure out.

At least now with OMWorld as part of the Lilliput project a more holistic
evaluation can be attempted, where maybe the whole can be greater than the sum
of its parts. (This includes smaller headers, OMWorld and stable klass pointers,
GC algorithms without forwarding pointers in the object header, loom support for
synchronized without pointers from the java heap into thread stacks).

### Secondary

Should be understood before integration.

#### Deflation Heuristics

The current heuristics that drive the deflation thread does not seem like a good
fit for how OMWorld works. In OMWorld the `AvgMonitorsPerThreadEstimate` is
disabled unless unless it is configured via the command line. Instead the
`_in_use_list_ceiling` is grown based on the number of actual inflations. There
is pre-existing weirdness in this heuristics.

Currently, for most observed workloads, deflation is driven by the
`GuaranteedAsyncDeflationInterval` flag.

#### C2 Fast Lock / Fast Unlock

The fast path in the C2 code is where the recursive lightweight initially came
from. Many alternative implementations have been evaluated. The only one which
has been somewhat promising is one which tries to make LM_LIGHTWEIGHT more like
LM_LEGACY by making recursive fast lock enters always take a fast lock exit, by
setting the a value in the BasicLock that signals to the exit paths that a
recursive fast lock  enter was performed and it can simply exit. This also
extended the first fast lock enter to save the previous lock stack top along
with a signal (differentiating it from an ObjectMonitor), which enabled the fast
lock exit paths to restore the top value without loading anything from the lock
stack. Effectively the exit could decode exactly what the enter did by reading
the BasicLock from the stack.

However this change is rather intrusive as it turns the BasicLock from an
optimization property to a correctness property. This prototype was only tested
on x86_64.

There are also things to explore with regards to how inflated locking and
unlocking is performed in the emitted code. This patch and the recursive
lightweight PRs tries to avoid changing things in this area. In this patch there
is currently this flag `bool OMRecursiveFastPath` which enable checking for
inflated recursion with a plain load instead of failing a CAS. On some hardware
just a few code paths which contains a recursive inflated locking is enough to
make the `OMRecursiveFastPath` be more performant.

And lastly most platforms in HotSpot only emitted JITed inflated enter exit code
for C2, even tough both C1 and the native wrapper share the same property of
only performing balanced locking. PPC is one exception, which uses the C2 code
for all but the interpreter (which must correctly handle un-balanced locking,
but could also handle inflated enter exit, though it would not be much better
than calling to the runtime). This issue is exacerbated by the next issue.

#### C2 Compilation Failures

During performance testing two issues came up. For some benchmarks there is
randomness which determines if the hot code of the program gets C1 or C2
compiled.

And some trivial micro-benchmarks using synchronized would cause both C1 and C2
to not compile and run completely in the interpreter.

The performance difference caused by this can be multiple orders of magnitude
larger than any changes to the locking code.

### Tertiary

Distant future work / if time permits.

#### Alternative Hash Table

The ConcurrentHashTable implementation in HotSpot is optimized for many
concurrent readers. There may be workloads where insertions and deletions
dominate. There currently is a boolean parameter try_read on the
`get_or_insert_monitor` method which is always called with true. try_read means
first do a read, then an insert instead of just an `insert_get`. Initially this
was used from contexts where the inflation was probably not caused by
contention, however doing this was many times slower than just opportunistically
always doing a read first. Maybe there usually was contention and it was the 
new/delete that was the culprit.

Regardless little thought has gone into if the current implementation is the
correct one for this purpose. From a maintenance perspective any alternative
solution would have to show significant gains over the current implementation.

#### Effects on 4 Byte Lilliput

The hash table requires a hash for the Object, and that will be the identity
hash. Given that 4 Byte Lilliput will not store this hash inside the header
there might be some interesting interactions here when ObjectMonitors are
inflated.

#### Deflation from Locking Threads

In the patch there is a prototype for allowing a locking thread which owns an
inflated lock to deflate the ObjectMonitor and transition back to fast locking.
Two flags exists:
```
bool OMDeflateAfterWait  // If the wait caused inflation, try to deflate after wait
bool OMDeflateBeforeExit // Attempt to deflate before the final exit
```

Some workloads have been seen to benefit from this, especially the first. But
they are not something which benefits all workloads (for obvious reasons, real
contention is a thing).

For the first issue with wait, there have been tentative ideas flying around
about separating the wait/notify mechanism from the inflated ObjectMonitor and
using something more lightweight. However this is highly theoretical at this
point.

For the second issue with having ObjectMonitors become deflated before the final
exit, I have toyed with the idea of introducing some more proboalistic mechanism
for deciding if deflation should be attempted. The main problem is that we need
per ObjectMonitor state, but when we deflate we lose this state. The main idea
is to let the deflation thread use the state of `_in_use_list` to reason about
if deflation is occurring to frequently for some specific object. Care would
have to be take as this could easily turn into an O(n^2) time or O(n) space
solution (n = #monitors).

A brief note on the deflation from a locking threads works.
The main idea is that deflation thread is blocked out by the virtue of the
ObjectMonitor being owned. The locking thread then sets the owner to anonymous
and act exactly as if it was the deflation thread except it does not use the
DEFLATER_MARKER and transitions the mark word to `0b00` Fast Locked. It is only
done before the final exit, and after a wait which itself cause inflated (and
the most recent entry on the stack) to not create an invalid lock stack order.
(The wait could potentially manage without being the most recent entry by
keeping a local copy of the lock stack from before the inflating, and then
restoring it after deflating).

## Miscellaneous

This will be some concluding words as well as some discussion on Why
LM_PLACEHOLDER

### Why LM_PLACEHOLDER

First LM_PLACEHOLDER is exactly that, a placeholder for the final locking mode,
be that called LM_LIGHTWEIGHT, or something else.

This was mainly because many discussions and changes regarding LM_LIGHTWEIGHT
stalled OMWorld development, and partially a defensive measure to not have
changes to OMWorld stall any other projects with regards to LM_LIGHTWEIGHT.
There is synchronized support for loom on the way, which should fit in
seamlessly with both LM_LIGHTWEIGHT and LM_PLACEHOLDER, but there are incentives
to use a lock stack based locking mode with these loom changes rather than a
LM_LEGACY based one.

One unintended benefit in the short term is that it will be a little easier to
compare and identify the different benefits and drawbacks between LM_LIGHTWEIGHT
and LM_PLACEHOLDER in Lilliput. (Obviously a maintenance burden in the long run,
but both seem rather stable in the Lilliput repo).

The ambition is to harmonize the LockingModes and be able to finally reduce them
to one or two, which would be a maintenance win for the whole of HotSpot.

### Final Words

If you read this whole thing in one go, my hats of too you.

I hope that this can serve as reference point which can be used to get more
contributors engaged in the project. And that it can be an actionable stepping
stone for bringing the Lilliput project closer to integration.

Sincerely
Axel