[master] RFR: ObjectMonitor Storage

Mon Apr 11 15:46:20 UTC 2022

On Thu, 17 Feb 2022 08:53:41 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

> Hi,
> 
> This prepares the way for an idea Roman had last year: to store OM references in a compressed form in the header, instead of as 64-bit pointer. Similar to narrow Klass pointers, we need an index or an offset into a memory region.
> 
> With this patch, OMs live in an array now. That new OM array is pre-reserved, gets committed on demand, has a freelist to manage released OMs. In a way, it is very similar to the old TSM solution before [JDK-8253064](https://bugs.openjdk.java.net/browse/JDK-8253064). Thanks a lot to @dcubed-ojdk for patiently explaining the details of the old solution to me([1], [2], [3]).
> 
> ### Performance and memory use
> 
> I found that the renaissance "philosophers" benchmark is a good tool for measuring ObjectMonitor memory usage and performance. The benchmark, if you run with default VM options, does a lot of synchronization and creates millions of OMs. Running with a lower value for `-XX:MonitorUsedDeflationThreshold` diminishes the effect, and according to Dan [3] it would be a typical case for threshold reduction too.
> 
> This is not a typical use case. Typically we have only a few thousand OMs, and OM storage management does not matter that much. But we don't want huge performance drops in these outlier scenarios.
> 
> The first version of my patch was very naive and used a single global allocator and synchronized each access. Performance loss was brutal, about 15% compared to malloc.
> 
> I improved the patch to reduce contention:
> - OMs are now preallocated in bulk on a per thread base (by default, 64, adjustable via `-XX:PreallocatedObjectMonitors`).
> - OMs are released in bulk by preparing the OM freelist off-lock and only drawing the lock when appending the list to the central free list.
> 
> Again, all somewhat similar to the old TSM solution.
> 
> The improved version now has similar or even better performance and memory use than the stock VM, see below.
> 
> #### Measurements
> 
> Running renaissance philosphers benchmark with both Stock (Lilliput) VM and patched lilliput VM.
> 
> Options: `-XX:+UnlockDiagnosticVMOptions -Xmx2g -Xms2g -XX:NativeMemoryTracking=summary -XX:+PrintNMTStatistics -XX:+DumpVitalsAtExit`
> 
> We compare two allocators, one outside our control, in the libc. So I measure RSS, not Committed memory use, because I have no idea how much memory the libc commits in order to fulfill its mallocs. It may actually be a lot - to prevent contention, at least the glibc uses thread local arenas, which can get huge but are often mostly unused.
> 
> For the same reason I do not use `AlwaysPretouch`: we would pretouch committed mmaped memory in the patched version, but neither os::malloc() nor whatever overhead the libc produces would be touched. `AlwaysPretouch` would hence bias against stock. 
> 
> The performance numbers and RSS numbers wobbled a bit, but the following run is average (smaller numbers better):
> 
> ##### Default run (`-XX:MonitorUsedDeflationThreshold=90`, `-XX:PreallocatedObjectMonitors=64`)
> 
> 1) Stock
> 
> Benchmark result: 	4333,61537
> Highest rss: 		3,7g
> 
> 2) New ObjectMonitorStorage
> 
> Benchmark result: 	4122,14034 (+4.9%)
> Highest rss: 		3,7g
> 
> 
> ##### Run with `-XX:MonitorUsedDeflationThreshold=50`
> 
> 1) Stock
> 
> Benchmark result: 	4202,40876
> Highest rss: 		rss=2,1g
> 
> 2) New ObjectMonitorStorage
> 
> Benchmark result: 	4142,66361 (+1.4%)
> Highest rss: 		1.9g (-9%)
> 
> 
> ##### Run with `-XX:PreallocatedObjectMonitors=1024`
> 
> 2) New ObjectMonitorStorage
> 
> Benchmark result: 	4322,59806 (-2.8%)
> Highest rss: 		3.5g (+66%)
> 
> #### Analysis
> 
> Patched version uses a bit less RSS and is 1..5% faster than unpatched version.
> 
> Enlarging the number of thread local preallocated OMs to 1024 was not so hot. Reduced contention comes at a high memory price and an actual performance loss too. There is an optimal point, maybe even the default of 64 is too high. I ran out of time though, but this is certainly a knob to optimize.
> 
> I suspect that if we go down this road, we'll need to spend more time optimizing memory and performance of OM storage. Even though my results are encouraging. But this is an area where a lot of tweaking happened upstream over time.
> 
> ### Patch details:
> 
> The patch introduces the general-purpose classes `AddressStableArray`. A templatified array that is pre-reserved, is address stable, contiguous, committed on demand, keeps a freelist of released items. Code is well tested (see the new gtests) and can serve as building block for similar uses. E.g. Roman played with the idea of placing Thread into such an array too.
> 
> ### What this patch does not:
> 
> This patch does not change the way OMs are stored in the markword. I had a quick glance, but its not trivial to find all places in generated code that read OMs from the mark word (e.g. `C2_MacroAssembler::rtm_inflated_locking`). I ran out of time and leave this for another day.
> 
> ### Tests:
> 
> - GHAs
> - SAP nightlies (scheduled)
> - manual test on Linux X64, x86, aarch64, Windows x64
> 
> [1] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-January/053683.html
> [2] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-February/053903.html
> [3] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-March/054187.html

We worked very hard to get rid of TSM. I would cry a bit if this monster gets brought back from the dead.
I also don't know if this compression is sound. I have seen SpecJBB2015 create millions of ObjectMonitors. So it doesn't sound too far fetched to assume that a larger app, for example one with a TB sized heap, would run out of address space for ObjectMonitors, even if only a reasonable percentage of objects need a monitor.
I would very much prefer a solution that instead removes the ObjectMonitor pointer from the markWord completely, in favour of publishing the mapping with a (possibly optimized) hash table instead. Then you only need the hashCode in the markWord, TSM can stay dead, and it will scale better to large apps. Lookup time might be smaller, but that sounds like the better price to pay I think.

-------------

PR: https://git.openjdk.java.net/lilliput/pull/39