RFR: 8339387: ZGC: Synchronize medium page allocation

Fri Sep 6 08:25:52 UTC 2024

On Fri, 6 Sep 2024 07:14:11 GMT, Stefan Johansson <sjohanss at openjdk.org> wrote:

> Please review this change to synchronize medium page allocations in ZGC.
> 
> **Summary**
> In ZGC objects of a certain size class are allocated in medium sized pages. For each age there is a single medium page shared by all mutators. When this page gets full all thread that try to do a medium page allocation will try to allocate and install a new medium page, but only one will succeed. This can lead to a lot of unnecessary medium page allocation which in turn can lead to the unnecessary page cache flushing.
> 
> This change introduces synchronization to only a allow a single thread to allocate the medium page in the common case.  
> 
> **Testing**
> * Functional testing through mach5 tier1-7 using ZGC
> * Performance testing through aurora to verify no regression occur
> * Manual testing to verify performance
> * Manual testing to verify we avoid page cache flushing

As mentioned in the summary, there is no direct performance improvement seen in most benchmarks by this change. But looking at memory usage from our logs we can see improvements in how ZGC uses memory. 

In the below statistics logging from the end of a benchmark run where medium objects are in use we can see some of the improvements. Even if they don't translate into a score improvement, they will improve the latency of some allocation operations.

Baseline:
[369.264s][info][gc,stats    ]                                      Last 10s              Last 10m 
[369.264s][info][gc,stats    ]                                     Avg / Max             Avg / Max
[369.264s][info][gc,stats    ] Memory: Allocation Rate             438 / 950             684 / 2846            684 / 2846            684 / 2846        MB/s
[369.264s][info][gc,stats    ] Memory: Defragment                    0 / 0                18 / 190              18 / 190              18 / 190         ops/s
[369.264s][info][gc,stats    ] Memory: Page Cache Flush              0 / 0                36 / 380              36 / 380              36 / 380         MB/s
[369.264s][info][gc,stats    ] Memory: Undo Page Allocation          0 / 1                 2 / 71                2 / 71                2 / 71          ops/s

With this change:
[369.104s][info][gc,stats    ] Memory: Allocation Rate             465 / 620             612 / 1086            612 / 1086            612 / 1086        MB/s
[369.104s][info][gc,stats    ] Memory: Defragment                    0 / 0                 0 / 0                 0 / 0                 0 / 0           ops/s
[369.104s][info][gc,stats    ] Memory: Page Cache Flush              0 / 0                 0 / 0                 0 / 0                 0 / 0           MB/s
[369.104s][info][gc,stats    ] Memory: Undo Page Allocation          0 / 0                 0 / 8                 0 / 8                 0 / 8           ops/s

Additional details about the different lines:
**Allocation rate** - The maximum allocation rate is down, because its not inflated by many unnecessary medium page allocation happening at once.
**Defragment** - ZGC try to defragment the virtual address space by remapping memory used by small page from high addresses to low. This will only happen when the page cache only caches medium and large pages, which might be case after a set of medium page allocations that are later undone. In this run all such defragmentations were avoided.
**Page Cache Flush** - When there are no medium (or large) pages available in the cache, the cache needs to be flushed to allow a creation of a new page. When not doing the unnecessary allocations ZGC is able to avoid flushing in this benchmark.
**Undo Page Allocation** - When a page is allocated but later found to not be needed, we undo the page allocation. This can happen for small pages as well, so we still have some undos. But the one for medium pages are avoided.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20883#issuecomment-2333513053