RFR: 8369048: GenShen: Defer ShenFreeSet::available() during rebuild

Tue Oct 7 15:24:24 UTC 2025

On Thu, 2 Oct 2025 17:58:48 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

> This code introduces a new rebuild-freeset lock for purposes of coordinating the freeset rebuild activities and queries as to memory available for allocation in the mutator partition.
> 
> This addresses a problem that results if available memory is probed while we are rebuilding the freeset.
> 
> Rather than using the existing global heap lock to synchronize these activities, a new more narrowly scoped lock is introduced.  This allows the available memory to be probed even when other activities hold the global heap lock for reasons other than rebuilding the freeset, such as when they are allocating memory.  It is known that the global heap lock is heavily contended for certain workloads, and using this new lock avoids adding to contention for the global heap lock.

Will identify this PR as draft until I complete performance and correctness tests.

I have these results from running Extremem tests on commit 99d0175

<img width="2010" height="896" alt="image" src="https://github.com/user-attachments/assets/20596fd4-fb4f-485c-97a4-643adfe25935" />

I am going to try an experiment with a different approach.  I will remove the synchronization lock and instead will cause the implementation of freeset rebuild to not update available() until after it is done with its work.  I think this may address the same problem with less run-time overhead.

On the same workload, here are the results of the experiment (rather than locking to prevent fetch of available during rebuild, we continue to return the value of available at the start of rebuild until rebuild finishes):

<img width="2009" height="454" alt="image" src="https://github.com/user-attachments/assets/33b38c57-8739-4814-8351-665c0aab0de2" />

General observations are that:

1. CPU utilization increased for both GenShen and Shen.
2. Number of completed GCs increased for GenShen but decreased for Shen
3. Shen degenerated GCs increased
4. GenShen P50 latency increased, but p95, p99, and p99.9 latencies decreased.  Higher latencies all increased for GenShen.
5. Shen latencies are worse at all percentiles.

Qualitatively, what would we expect?  If we return an old value for available() during freeset rebuild, we are usually causing triggering heuristics to believe there is less memory available than is actually available.  This may cause us to trigger GC more aggressively.  This bears out for GenShen, but not for Shen.

With GenShen, the critical conflict occurs when old marking has completed, and we rebuild the free set following old marking in order to recycle immediate old garbage and to set aside old-collector reserves which will be required for anticipated mixed evacuation GC cycles that will immediately follow.  While this is happening, the Shenandoah regulator thread is trying to decide whether it should interrupt old GC in order to perform an "urgent" young GC cycle.  And sometimes, the regulator thread's inquiry as to how much memory is available sees a bogus (not just stale, but out of thin air) value because the freeset is under construction at the time of its inquiry.  Preventing this bogus value is the point of this PR.

This situation does not generally happen with traditional Shenandoah.  Traditional Shenandoah only queries the available() during times when GC is idle.  (There are plans to change this, to allow the freeset to be rebuilt more asynchronously, so we are testing this coordination mechanism out for both GenShen and Shen.). A plausible explanation for the observed impact on Shen is that the absence of synchronization allows Shen to see more stale values of available(), even when we are not conflicting with concurrent freeset rebuilds.  Specifically, if we gnawing away on available memory, probing available() every ms, the triggering heuristic may see the same value of available() for three consecutive probes.  Not recognizing that memory has been consumed, it will delay triggering of the next GC cycle, resulting in fewer concurrent GCs with the "unsynchronized" solution.  Besides resulting in fewer GC cycles, the late triggers also allow us to get closer to total depletion of the allocat
 able memory pool, which explains an increase in Shenandoah degenerated cycles.

Presumably, GenShen is also vulnerable to this possibility.  But the benefit of eliminating out-of-thin-air available values for GenShen seems to outweigh the risk of occasional stale values that cause late triggers.

For further context, here are CI pipeline performance summaries for the initial synchronized solution:

   Control: openjdk-master-aarch64
Experiment: synchronize-available-with-rebuild-gh-aarch64

Genshen
-------------------------------------------------------------------------------------------------------
+45.80% specjbb2015/trigger_failure p=0.00542
  Control:    365.562   (+/-158.45  )        109
  Test:       533.000   (+/-200.37  )         10

+28.53% scimark.lu.large/concurrent_update_refs_young p=0.00020
  Control:      5.608ms (+/-  1.91ms)         34
  Test:         7.208ms (+/-107.48us)          2

+24.44% specjbb2015/concurrent_update_refs_degen_young p=0.00563
  Control:    804.287ms (+/-330.68ms)         41
  Test:         1.001s  (+/-101.83ms)          8

and for the "unsynchronized" solution:

   Control: openjdk-master-aarch64
Experiment: synchronize-available-with-rebuild-gh-aarch64

Genshen
-------------------------------------------------------------------------------------------------------
+51.82% hyperalloc_a2048_o4096/finish_mark_degen_young p=0.00771
  Control:     82.769ms (+/- 66.46ms)         66
  Test:       125.658ms (+/- 78.91ms)         43

The p values for all of these measures are a bit high, based on limited samples of relevant data.  The unsynchronized data result is combined with previous measurements taken from the synchronized experiments.

One other somewhat subjective observation is that the synchronized solution experienced many more "timeout" failures on the CI pipeline than the unsynchronized solution.  These timeout failures correlate with stress workloads that exercise the JVM in abnormal/extreme ways.  Under these stresses, the unsynchronized mechanism seems to be a bit more robust.

I'm inclined to prefer the synchronized solution so will revert my most recent three commits.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/27612#issuecomment-3362378108
PR Comment: https://git.openjdk.org/jdk/pull/27612#issuecomment-3372885805
PR Comment: https://git.openjdk.org/jdk/pull/27612#issuecomment-3377284428
PR Comment: https://git.openjdk.org/jdk/pull/27612#issuecomment-3377318918
PR Comment: https://git.openjdk.org/jdk/pull/27612#issuecomment-3377335205
PR Comment: https://git.openjdk.org/jdk/pull/27612#issuecomment-3377355113