RFR: Load balance remset scan

Fri Nov 18 22:57:55 UTC 2022

On Fri, 18 Nov 2022 18:49:08 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

> Prior to this change, the initial group of remembered set assignments was given to worker threads one entire region at a time.  We found that with large region sizes (e.g. 16 MiB and above), this resulted in too much imbalance in the work performed by individual threads.  A few threads assigned to scan 16 MiB regions with high density of "interesting pointers" were still scanning after all other worker threads finished their scanning efforts.
> 
> This change caps the maximum assignment size for worker threads at 4 MiB.  This results in better distribution of efforts between multiple concurrent threads.  With 13 worker threads and 16 MiB heap regions, we observe the following benefits on an Extremem workload (46_064 MiB heap size, 27_648 MiB new size):
> 
> Latency for customer preparation processing improved by 0.79% for P50, 2.26% for P95, 8.21% for p99, 28.17% for p99.9, 86.59% for p99.99, 86.77% for p99.999.  The p100 response improved only slightly, by 1.99%.
> 
> Average time for concurrent remembered set marking scan improved by 1.92%.  The average time for concurrent update refs time, which includes remembered set scanning, improved by 1.72%.

I'll add a comment to explain the rationale behind successively smaller chunks.  General idea is that as we get closer to the end of the total effort, we want to be more careful to avoid giving one of the worker threads a disproportionately large amount of work to do.  Early in the total effort, it's ok for one thread to get a larger assignment than the others.  In this case, the thread with the larger effort will chew away on that large assignment while all the other threads repeatedly receive and finish work assignments that can be completed more quickly.

-------------

PR: https://git.openjdk.org/shenandoah/pull/173