RFR: 8341379: Shenandoah: Improve lock contention during cleanup

Wed Oct 2 16:47:37 UTC 2024

On Thu, 26 Sep 2024 21:08:11 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort.  The affected code runs at the end of FinalMark to reclaim immediate garbage.  It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated.
> 
> Efficiency improvements include:
> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed.
> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed.
> 
> Below, each trial runs for 1 hour, processing 28,000 transactions per second.
> 
> Without this change, latency for 4 un-named business services is represented by the following chart:
> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3)
> 
> With this change, latency for the same services is much better:
> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64)
> 
> A comparison of the two is provided by the following:
> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b)

I've got a bit more information about the differences in behavior between no-batch trial 1 and trial 2:
1. Note that trial2 has much worse p9999 latency than trial1
2. The difference is NOT safepoint behavior.  Trial 1 actually had more safepoints that lasted longer than 1 ms, with the longest lasting 5.658220ms.  The longest safepoint in trial 2 was 3.420009 ms.
3. There is evidence to suggest that the difference stems from concurrent cleanup: trial1 had 1 concurrent cleanup event taking more than 1 ms, with time of 1.142 ms, average cleanup time of 85.1 us; trial 3 had 3 concurrent cleanup events taking more than 1 ms, with the max of 1.377 ms, average cleanup time of 85.8 us.
4. For comparison, the three runs with this fix had an average concurrent cleanup event time of 69.8 us.

Qualitative assessment:

This fix allows concurrent cleanup to happen on average in 18.3% less time.  This means it is less likely to collide with a mutator thread in access to the shared heap lock.

When a collision does occur, it is resolved more quickly, allowing the the mutator to proceed in no more than 8 us plus the time to process one batch of 32 regions rather than having to wait a max of 30 us.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2389135297