RFR: 8331572: Allow using OopMapCache outside of STW GC phases

Tue May 14 18:24:23 UTC 2024

On Tue, 14 May 2024 12:31:08 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> As the reproducer in the issue shows, we would also like to use the `OopMapCache` during the concurrent GC phases. Zhengyu mentions there is also a production problem for stack walking that would benefit from letting `OopMapCache` be used without looking at GC at all.
> 
> This PR unblocks `OopMapCache` uses for everything. Cleanups are nominally done by service thread. But, still appreciating that majority of use cases would be from GCs, we leave the proactive cleanups from the GC ops here as well. It requires the synchronization between readers that might be copying out the entries out of the hashmap and the concurrent reclamation. Handily, `GlobalCounter` can be used for that purpose. 
> 
> After this lands, I think we can go over `OopMapCache::compute_one_oop_map` uses and see if they would instead like to use the cached `lookup` to benefit from this cache too. I think those paths are for OSR and deopts, so their performance is unlikely to be critical. This PR already covers the concurrent GC paths well.
> 
> Additional testing:
>  - [x] Performance test reproducer from the bug improves significantly
>  - [x] Linux AArch64 server fastdebug, `hotspot_gc_shenandoah` (10x)
>  - [ ] Linux AArch64 server fastdebug, `all`
>  - [x] Linux x86_64 server fastdebug, `all`

Performance note: there is an intrinsic tradeoff here between the cost of acquiring the critical section vs the concurrency it unblocks for non-STW GCs and the cache improvements on non-GC paths. The critical section overhead is mostly due to the fence in https://github.com/openjdk/jdk/blob/5a4415a6bddb25cbd5b87ff8ad1a06179c2e452e/src/hotspot/share/utilities/globalCounter.inline.hpp#L43

So, the original reproducer (very stressy, with lots of interpreter frames) improves dramatically (73 -> 6ms) with Shenandoah GC, but run with Serial GC reveals there is a slight regression in GC times (74 -> 79 ms). I have not been able to replicate this regression in larger benchmarks.

Anyhow, this very fine-grained regression nearly disappears (74.1 -> 74.3 ms on Serial) if we optimize the other part of this whole path a bit, done in this PR: https://github.com/openjdk/jdk/pull/19229/commits/455687addeba55dc998dbf9ab4b8ec58f0b69ee4. This also improves Shenandoah times further (6.1 -> 5.6 ms).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19229#issuecomment-2110429057