[master] RFR: ObjectMonitor Storage

Tue Mar 29 05:48:02 UTC 2022

Hi,

This prepares the way for an idea Roman had last year: to store OM references in a compressed form in the header, instead of as 64-bit pointer. Similar to narrow Klass pointers, we need an index or an offset into a memory region.

With this patch, OMs live in an array now. That new OM array is pre-reserved, gets committed on demand, has a freelist to manage released OMs. In a way, it is very similar to the old TSM solution before [JDK-8253064](https://bugs.openjdk.java.net/browse/JDK-8253064). Thanks a lot to @dcubed-ojdk for patiently explaining the details of the old solution to me([1], [2], [3]).

### Performance and memory use

I found that the renaissance "philosophers" benchmark is a good tool for measuring ObjectMonitor memory usage and performance. The benchmark, if you run with default VM options, does a lot of synchronization and creates millions of OMs. Running with a lower value for `-XX:MonitorUsedDeflationThreshold` diminishes the effect, and according to Dan [3] it would be a typical case for threshold reduction too.

This is not a typical use case. Typically we have only a few thousand OMs, and OM storage management does not matter that much. But we don't want huge performance drops in these outlier scenarios.

The first version of my patch was very naive and used a single global allocator and synchronized each access. Performance loss was brutal, about 15% compared to malloc.

I improved the patch to reduce contention:
- OMs are now preallocated in bulk on a per thread base (by default, 64, adjustable via `-XX:PreallocatedObjectMonitors`).
- OMs are released in bulk by preparing the OM freelist off-lock and only drawing the lock when appending the list to the central free list.

Again, all somewhat similar to the old TSM solution.

The improved version now has similar or even better performance and memory use than the stock VM, see below.

#### Measurements

Running renaissance philosphers benchmark with both Stock (Lilliput) VM and patched lilliput VM.

Options: `-XX:+UnlockDiagnosticVMOptions -Xmx2g -Xms2g -XX:NativeMemoryTracking=summary -XX:+PrintNMTStatistics -XX:+DumpVitalsAtExit`

We compare two allocators, one outside our control, in the libc. So I measure RSS, not Committed memory use, because I have no idea how much memory the libc commits in order to fulfill its mallocs. It may actually be a lot - to prevent contention, at least the glibc uses thread local arenas, which can get huge but are often mostly unused.

For the same reason I do not use `AlwaysPretouch`: we would pretouch committed mmaped memory in the patched version, but neither os::malloc() nor whatever overhead the libc produces would be touched. `AlwaysPretouch` would hence bias against stock. 

The performance numbers and RSS numbers wobbled a bit, but the following run is average (smaller numbers better):

##### Default run (`-XX:MonitorUsedDeflationThreshold=90`, `-XX:PreallocatedObjectMonitors=64`)

1) Stock

Benchmark result: 	4333,61537
Highest rss: 		3,7g

2) New ObjectMonitorStorage

Benchmark result: 	4122,14034 (+4.9%)
Highest rss: 		3,7g

##### Run with `-XX:MonitorUsedDeflationThreshold=50`

1) Stock

Benchmark result: 	4202,40876
Highest rss: 		rss=2,1g

2) New ObjectMonitorStorage

Benchmark result: 	4142,66361 (+1.4%)
Highest rss: 		1.9g (-9%)

##### Run with `-XX:PreallocatedObjectMonitors=1024`

2) New ObjectMonitorStorage

Benchmark result: 	4322,59806 (-2.8%)
Highest rss: 		3.5g (+66%)

#### Analysis

Patched version uses a bit less RSS and is 1..5% faster than unpatched version.

Enlarging the number of thread local preallocated OMs to 1024 was not so hot. Reduced contention comes at a high memory price and an actual performance loss too. There is an optimal point, maybe even the default of 64 is too high. I ran out of time though, but this is certainly a knob to optimize.

I suspect that if we go down this road, we'll need to spend more time optimizing memory and performance of OM storage. Even though my results are encouraging. But this is an area where a lot of tweaking happened upstream over time.

### Patch details:

The patch introduces the general-purpose classes `AddressStableArray`. A templatified array that is pre-reserved, is address stable, contiguous, committed on demand, keeps a freelist of released items. Code is well tested (see the new gtests) and can serve as building block for similar uses. E.g. Roman played with the idea of placing Thread into such an array too.

### What this patch does not:

This patch does not change the way OMs are stored in the markword. I had a quick glance, but its not trivial to find all places in generated code that read OMs from the mark word (e.g. `C2_MacroAssembler::rtm_inflated_locking`). I ran out of time and leave this for another day.

### Tests:

- GHAs
- SAP nightlies (scheduled)
- manual test on Linux X64, x86, aarch64, Windows x64

[1] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-January/053683.html
[2] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-February/053903.html
[3] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-March/054187.html

-------------

Commit messages:
 - sketch encode/decode
 - AddressStable... rework initialization
 - AddressStableHeap rename to array with freelist
 - trim some fat (unused functions)
 - Move ReservedSpace outside of OM store
 - fix indentation
 - commit step size calculated automatically
 - 2nd attempt to satisfy mac aarch compiler
 - We can now uncommit the address stable heap
 - Try fix darwinintel
 - ... and 12 more: https://git.openjdk.java.net/lilliput/compare/75aa69e4...ab839fb1

Changes: https://git.openjdk.java.net/lilliput/pull/39/files
 Webrev: https://webrevs.openjdk.java.net/?repo=lilliput&pr=39&range=00
  Stats: 1570 lines in 20 files changed: 1555 ins; 0 del; 15 mod
  Patch: https://git.openjdk.java.net/lilliput/pull/39.diff
  Fetch: git fetch https://git.openjdk.java.net/lilliput pull/39/head:pull/39

PR: https://git.openjdk.java.net/lilliput/pull/39