RFR: 8357443: ZGC: Optimize old page iteration in remap remembered phase [v2]

Erik Österlund eosterlund at openjdk.org
Wed May 28 16:50:55 UTC 2025


On Wed, 21 May 2025 12:45:09 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:

>> Before starting the relocation phase of a major collection we remap all pointers into the young generation so that we can disambiguate when an oop has bad bits for both the young generation and the old generation. See comment in remap_young_roots.
>> 
>> One part of this is requires us to visit all old pages. To parallelize that part we have a class that distribute indices to the page table to the GC worker threads (See ZIndexDistributor).
>> 
>> While looking into a potential, minor performance regression on Windows I noticed that the usage of constexpr in ZIndexDistributorClaimTree wasn't giving us the inlining we hoped for, which caused a noticeable worse performance on Windows compared to the other platforms. I created a patch for this that gave us the expected inlining. See https://github.com/openjdk/jdk/compare/master...stefank:jdk:8357443_zgc_optimize_remap_remembered
>> 
>> While thinking about this a bit more I realized that we could use the "found old" optimization that we already use for the remset scanning. This finds the old pages without scanning the entire page table. This gives a significant enough boost that I propose that we do that instead. 
>> 
>> This mainly lowers the Major Collection times when you run a GC without any significant amount of objects in the old generation. So, most likely mostly important for micro benchmarks and small workloads.
>> 
>> The below is the average time (ms) of the Concurrent Remap Roots phase from only running `System.gc()` 50 times before and after this PR.
>> 
>> 
>> 4 GB MaxHeapSize
>>                     Original       Patch
>> Default threads
>> 
>> mac:                0.27812        0.0507
>> win:                0.9485         0.10452
>> linux-x64:          0.53858        0.092
>> linux-x64 NUMA:     0.89974        0.15452
>> linux-aarch64:      0.32574        0.15832
>> 
>> 4 threads
>> 
>> mac:                0.19112        0.04916
>> win:                0.83346        0.08796
>> linux-x64:          0.57692        0.09526
>> linux-x64 NUMA:     1.23684        0.17008
>> linux-aarch64:      0.334          0.21918
>> 
>> 1 thread:
>> 
>> mac:                0.19678        0.0589
>> win:                1.96496        0.09928
>> linux-x64:          1.00788        0.1381
>> linux-x64 NUMA:     2.77312        0.21134
>> linux-aarch64:      0.63696        0.31286
>> 
>> 
>> The second set of data is from using the extreme end of the supported heap size. This mimics how we previously used to have a large page table even for smaller heap size ...
>
> Stefan Karlsson has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Apply suggestions from code review
>   
>   Co-authored-by: Axel Boldt-Christmas <xmas1915 at gmail.com>

Looks good.

-------------

Marked as reviewed by eosterlund (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/25345#pullrequestreview-2875785328


More information about the hotspot-gc-dev mailing list