RFR: 8357533: ZGC: ZIndexDistributorClaimTree functions not inlined on Windows
Stefan Karlsson
stefank at openjdk.org
Wed Jun 4 15:01:54 UTC 2025
On Thu, 22 May 2025 08:07:36 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:
> While investigating a minor performance regression in a System.gc() microbenchmark I saw that the Concurrent Remap Roots phase took a significant portion of the GC cycle time.
>
> It turns out that even though we use constexpr we don't get the inlining of the the functions as we expected and this results in a noticeable performance loss.
>
> When running System.gc() with on old GC worker thread with a 16TB max heap size we spend around 300 ms just iterating over the page table and its indices via ZIndexDistributorClaimTree. If the inlining is fixed this drops down to ~70 ms, which is similar to what we see on Linux and MacOS.
>
> We already have a patch out to remove the last usage of the ZIndexDistributor mechanism in https://github.com/openjdk/jdk/pull/25345, but ZIndexDistributor has less restrictions and is easier to use so we might want to keep it around (with fixed performance) so that it can be used for prototyping and maybe future features.
>
> Noteworthy in the patch is the following function:
>
> static constexpr int level_multiplier(int level) {
> assert(level < ClaimLevels, "Must be");
> constexpr int array[ClaimLevels]{16, 16, 16};
> return array[level];
> }
>
>
> When the last statement is changed to:
>
> constexpr int result = array[level];
> return result;
>
>
> The MS compiler complains that the expression is not a constant. And if we get the correct inlining the warning goes away.
>
> To get the required inlining I've changed the `level` parameter to be a template parameter and restructured the code to use template specialization instead of if-statements. This required some finagling with template classes to get partial specialization to work.
>
> Some other changes in the patch:
> * Limited includes of zIndexDistributor.inline.hpp
> * Extracted the atomic inc into zfetch_then_inc for easier evaluation of the performance impact of our Atomic implementation vs relaxed std::atomic vs non-atomic updates.
> * Hid the logging under a compile-time define. The logging is useful when changing/debugging this code, but otherwise it is not.
Given that we have replaced the last actively used instance of this class, this is not urgent to get integrated into JDK 25. It also adds even more complexity to a already somewhat complex class. If we start to use C++ 17 we might be able to leverage if constexpr and thereby lower the extra complexity that this patch adds via template specializations. Let's postpone this PR and see if we get to C++ 17 soon.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/25385#issuecomment-2940374196
More information about the hotspot-gc-dev
mailing list