RFR: 8357443: ZGC: Optimize old page iteration in remap remembered phase
Stefan Karlsson
stefank at openjdk.org
Wed May 21 09:55:02 UTC 2025
Before starting the relocation phase of a major collection we remap all pointers into the young generation so that we can disambiguate when an oop has bad bits for both the young generation and the old generation. See comment in remap_young_roots.
One part of this is requires us to visit all old pages. To parallelize that part we have a class that distribute indices to the page table to the GC worker threads (See ZIndexDistributor).
While looking into a potential, minor performance regression on Windows I noticed that the usage of constexpr in ZIndexDistributorClaimTree wasn't giving us the inlining we hoped for, which caused a noticeable worse performance on Windows compared to the other platforms. I created a patch for this that gave us the expected inlining. See https://github.com/openjdk/jdk/compare/master...stefank:jdk:8357443_zgc_optimize_remap_remembered
While thinking about this a bit more I realized that we could use the "found old" optimization that we already use for the remset scanning. This finds the old pages without scanning the entire page table. This gives a significant enough boost that I propose that we do that instead.
This mainly lowers the Major Collection times when you run a GC without any significant amount of objects in the old generation. So, most likely mostly important for micro benchmarks and small workloads.
The below is the average time (ms) of the Concurrent Remap Roots phase from only running `System.gc()` 50 times before and after this PR.
4 GB MaxHeapSize
Original Patch
Default threads
mac: 0.27812 0.0507
win: 0.9485 0.10452
linux-x64: 0.53858 0.092
linux-x64 NUMA: 0.89974 0.15452
linux-aarch64: 0.32574 0.15832
4 threads
mac: 0.19112 0.04916
win: 0.83346 0.08796
linux-x64: 0.57692 0.09526
linux-x64 NUMA: 1.23684 0.17008
linux-aarch64: 0.334 0.21918
1 thread:
mac: 0.19678 0.0589
win: 1.96496 0.09928
linux-x64: 1.00788 0.1381
linux-x64 NUMA: 2.77312 0.21134
linux-aarch64: 0.63696 0.31286
The second set of data is from using the extreme end of the supported heap size. This mimics how we previously used to have a large page table even for smaller heap size (we don't do that anymore for JDK 25). It shows a quite significant difference, but it also will likely be in the noise when running larger workloads.
16 TB MaxHeapSize
Original Patch
Default threads
mac: 11.4903 0.11098
win: 54.3666 0.37164
linux-x64: 18.0898 0.21094
linux-x64 NUMA: 26.9786 0.46134
linux-aarch64: 20.7151 0.32846
4 threads
mac: 6.4035 0.10096
win: 89.5496 0.32178
linux-x64: 27.883 0.2053
linux-x64 NUMA: 35.5636 0.30928
linux-aarch64: 15.4857 0.32004
1 thread:
mac: 21.2717 0.1275
win: 307.155 0.3361
linux-x64: 62.5843 0.2309
linux-x64 NUMA: 92.0048 0.3798
linux-aarch64: 61.0375 0.42458
This change removes the last usage of ZIndexDistributor. I don't know if we want to remove it, or leave it in case we need it for any of our upcoming features.
I've run this through tier1-7.
-------------
Commit messages:
- 8357443: ZGC: Optimize old page iteration in remap remembered phase
Changes: https://git.openjdk.org/jdk/pull/25345/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25345&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8357443
Stats: 107 lines in 4 files changed: 49 ins; 14 del; 44 mod
Patch: https://git.openjdk.org/jdk/pull/25345.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/25345/head:pull/25345
PR: https://git.openjdk.org/jdk/pull/25345
More information about the hotspot-gc-dev
mailing list