RFR: 8350441: ZGC: Overhaul Page Allocation

Wed Apr 9 13:56:53 UTC 2025

> Note that any reference to pages from here on out refers to the concept of a heap region in ZGC, not pages in the operating system (OS), unless stated otherwise.

# Background

This PR addresses fragmentation by introducing a Mapped Cache that replaces the Page Cache in ZGC. The largest limitation of the Page Cache is that it is constrained by the abstraction of what a page is. The proposed Mapped Cache removes this limitation by decoupling memory from pages, allowing it to merge and split memory in ways that the Page Cache is not suited for. To facilitate the transition, much of the Page Allocator has been redesigned to work with the Mapped Cache.

In addition to fighting fragmentation, the new approach improves NUMA-support and simplifies memory unampping. Combined, these changes lay the foundation for even more improvements in ZGC, like replacing multi-mapped memory with anonymous memory.

# Why a Mapped Cache?

The main benefit of the Mapped Cache is that adjacent virtual memory ranges in the cache can be merged to create larger ranges, enabling larger allocation requests to succeed more easily. Most notably, it allows allocations to succeed more often without "harvesting" smaller, discontiguous ranges. Harvesting negatively impacts both fragmentation and latency, as it requires remapping memory into a new contiguous virtual address range. Fragmentation becomes especially problematic in long-running programs and in environments with limited address space, where finding large contiguous regions can be difficult and may lead to premature Out Of Memory Errors (OOME).

The Mapped Cache uses a self-balancing binary search tree to store memory ranges. Since the ranges are unused when inside the cache, the tree can use this memory to store metadata about itself, referred to as intrusive storage. This approach eliminates the need for dynamic memory allocation (e.g., malloc), which could otherwise introduce a latency overhead.

# Fragmentation

Currently, ZGC has multiple strategies for dealing with fragmentation. In some edge cases, these strategies are not as efficient as we would like. By addressing fragmentation differently with the Mapped Cache, ZGC is in a better position to avoid edge cases, which are bad even if they occur only once. This is especially impactful for programs running with a large heap.

## Virtual Memory Shuffling

In addition to the Mapped Cache, we have made some adjustments in how ZGC deals with virtual memory. When harvesting memory, which needs to be remapped, new contiguous virtual memory must first be claimed. We have now added a feature in which the harvested memory can be re-used to improve the likelihood of finding a contiguous range. Additionally, we have re-designed the defragmentation policy so that Large pages are always defragmented upon being freed. When freed, they are broken down and remapped into lower address space, in the hopes of "filling holes" and creating more contiguous ranges.

# NUMA and Partitions

In the current policy, ZGC interleaves memory across all NUMA nodes with a granularity of ZGranuleSize (2MB), which is the same size as a Small page. As a result, Small pages will end up on a single, preferably local, NUMA node, whilst larger allocations will (likely) end up on multiple NUMA nodes. In the new design, the policy is to prefer allocating *all* allocation sizes to the local NUMA node whenever possible. As an effect, ZGC may be able to extract better performance from NUMA systems.

To support local NUMA allocations, the Page Allocator, and in turn the Java heap, has been split up into what we refer to as Partitions. A partition keeps track of its own heap size and Mapped Cache, allowing it to only handle memory that is associated with its own share of the heap. The number of partitions is currently the same as the number of NUMA nodes. On non-NUMA systems, only a single partition is kept track of.

The introduction of partitions also establishes a foundation for more fine-grained control over the heap, paving the way for future enhancements, both NUMA possibilities and new features, such as Thread-Local GC.

# Defragmentation (Unmapping Memory)

Up until now, ZGC has unmapped memory asynchronously in a separate thread. The benefit of this is that other threads do not need to take a latency hit when unmapping memory. The main dependency on asynchronous unmapping is when harvesting, especially from a mutator thread, where synchronous unmapping could lead to unwanted latency.

With the introduction of the Mapped Cache, and by moving defragmentation away from mutator threads to the GC, asynchronous unmapping is no longer necessary to meet our latency goals. Instead, memory is now unmapped synchronously. The number of times memory is defragmented for page allocations has been reduced significantly. However, memory for Small pages never needs to be defragmented at all. For Large pages, memory defragmentation has little effect on the total latency, as they are costly to allocate anyways. For Medium pages, we have plans for future enhancements where memory is defragmented even less, or not at all.

For clarity: with the removal of asynchronous unmapping, we have removed the ZUnmapper thread and ZUnmap JFR event.

# Multi-Mapped Memory

Asynchronous unmapping has so far been possible because ZGC is backed by shared memory (on Linux), which allows memory to be multi-mapped. This is an artifact from non-generational ZGC, which used multi-mapping in its core design (See [this](https://wiki.openjdk.org/display/zgc/Pointer+Metadata+using+Multi-Mapped+memory) resource for more info). A goal we have in ZGC is to move from shared memory to anonymous memory. There are multiple benefits with anonymous memory, one of them being easier configuration for Transparent Huge Pages (OS pages). Anonymous memory doesn't support multi-mapped memory, and would be blocked by the asynchronous unmapping feature. However, with the removal of asynchronous unmapping, we are now better prepared for transitioning to anonymous memory.

# Additional Notes

This RFE comes with our own implementation of a red-black tree for the Mapped Cache. Another red-black tree was recently introduced by C. Norrbin in [JDK-8345314](https://bugs.openjdk.org/browse/JDK-8345314) (and enhanced in [JDK-8349211](https://bugs.openjdk.org/browse/JDK-8349211)). Our goal is to initially integrate with our implementation, but remove our implementation in favor of Norrbin's tree in a future RFE. The reason we have our own tree implementation is because Norrbin's tree was not finished during the time we were developing and testing this RFE.

Some new additions have been made to keep the current functionality in the Serviceability Agent (SA).

# Testing

* Oracle's tiers 1-8
* We have added a small set of new tests, both gtests and jtreg tests, to test new functionality

# Performance

* Improvements in tail latency in SPECjbb2015.

* Improvements when using small OS pages in combination with NUMA.

* Small increase in the time it takes to run a GC. This is because some work has been moved from mutator threads to only be done in GC threads. This should not affect the total run-time of a program as the total work remains the same, but mutator latency is improved.

* Other suitable benchmarks show no significant improvements or regressions.

-------------

Commit messages:
 - Whitespace fix in zunittest.hpp
 - Copyright years
 - 8350441: ZGC: Overhaul Page Allocation

Changes: https://git.openjdk.org/jdk/pull/24547/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24547&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8350441
  Stats: 12052 lines in 118 files changed: 7936 ins; 3218 del; 898 mod
  Patch: https://git.openjdk.org/jdk/pull/24547.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24547/head:pull/24547

PR: https://git.openjdk.org/jdk/pull/24547