RFR: 8341379: Shenandoah: Improve lock contention during cleanup [v3]

Thu Oct 3 21:46:36 UTC 2024

On Thu, 3 Oct 2024 21:30:12 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

>> This change improves the efficiency of cleaning up (recycling) regions that have been trashed by GC effort.  The affected code runs at the end of FinalMark to reclaim immediate garbage.  It runs at the end of FinalUpdateRefs to reclaim the regions that comprised the collection set, from which all live objects have now been evacuated.
>> 
>> Efficiency improvements include:
>> 1. Rather than invoking the os (while holding the Heap lock) to obtain the time twice for every region recycled, we invoke the os only once for each batch of 32 regions that are to be processed.
>> 2. Rather than enforcing that the loop runs no longer than 30 us, we refrain from starting a second batch of regions if more than 8 us has passed since the preceding batch was processed.
>> 
>> Below, each trial runs for 1 hour, processing 28,000 transactions per second.
>> 
>> Without this change, latency for 4 un-named business services is represented by the following chart:
>> ![image](https://github.com/user-attachments/assets/0e36025b-7b76-4e7a-ab07-303ea49479c3)
>> 
>> With this change, latency for the same services is much better:
>> ![image](https://github.com/user-attachments/assets/aceaf185-6944-4c91-b98e-06ccd1bc2d64)
>> 
>> A comparison of the two is provided by the following:
>> ![image](https://github.com/user-attachments/assets/7145f7b5-2a65-44b0-a94a-ddbc871f236b)
>
> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix typo

I restructured the mechanism that decides whether to start processing another batch of regions.  In the code as originally structured, we would start another batch whenever the time since previous yield is less than 8 us.  In this new version of the code, we start another batch whenever the time since previous yield plus the predicted time to process another batch is less than 10 us.  Prediction of how long a batch will take to process is based on most recent history.  Note that this is highly dependent on the state of the application.  If the application has lots of mutator threads that are urgently allocating, there will be contention for the global heap lock and there will be contention for CPU cores, and both of these will cause the batch processing time to increase.

As expected, this new version does a better job of constraining the amount of time between yields.  With this new code, we will never start processing another batch unless we have some confidence that the new batch will complete on schedule.  Previously, we might have completed processing a first batch at time 7.5 us.  Then, we would have immediately started to process a second patch, even though a prediction of batch processing time would have allowed us to predict that this second batch would not finish until time 15 us.

The new version of the code will sometimes cause mutators to wait slightly longer for yield and the heap lock when the system is "lightly loaded".  This is shown to effect p50 latencies.  When the system is lightly loaded, the time to process a batch (on current hardware) is approximately 1.5 us.   Suppose we finish processing a batch at time 8.25 us.  In the original implementation, we would have immediately yielded.  However, in this new version of the code, we'll take the next batch, because this is predicted to complete at time 9.75 us, which is less than deadline 10 us.  Making this choice allows the GC thread that is recycling trash regions to have higher throughput.

Here are the results of testing this new version of the code:

![image](https://github.com/user-attachments/assets/5acacae4-0604-400c-a3f7-3899274414a8)

and this is how this new version compares to mainline without this PR:

![image](https://github.com/user-attachments/assets/44df0fc2-04c7-4593-923e-05a50f6767ed)

Most of the p99.99 percentile latencies are considerably improved.  There is a slight degradation of p50 latencies.

Which would be considered the preferred solution?  I'll vote for the new code, but can be persuaded either way.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21211#issuecomment-2392391683