RFR: 8351500: G1: NUMA migrations cause crashes in region allocation [v3]

Thomas Stuefe stuefe at openjdk.org
Thu Mar 13 11:07:21 UTC 2025


> For details, please see JBS issue.
> 
> _Please note that this bug only shows symptoms in JDK 21 and JDK 17! Due to code shuffling done as part of G1 region-local pinning work, the error does not show in JDKs 22 and later._
> 
> I originally planned to fix this just for JDK 21 and 17 (see https://github.com/openjdk/jdk21u-dev/pull/1460). However, I would rather have it fixed in the mainline, even though it is symptom-free. It is a lingering issue that may surface later if the code is ever changed. Plus, this prevents the fix from being accidentally overwritten in JDK 21 if we backport.
> 
> ----
> 
> The fix is simple: we fix (hah) the NUMA association for the full duration of a heap allocation in G1. That way, regardless of the OS scheduler moving the thread to a different NUMA node, we always use the same `G1AllocRegion` object, and changes in the control flow that rely on that won't break on NUMA.
> 
> This has the disadvantage of allocating memory from a node we are potentially moving away from. However, I argue that this is exceedingly rare, and if it happens, the OS will cope by eventually migrating the memory to the correct node.
> 
> ---
> 
> Testing:
> 
> Testing is difficult. See remark in JBS issue. 
> 
> I tested a modified version of this patch on JDK 21, where the error does cause crashes. I tested with an additional patch mimicking tons of NUMA node migrations. As I wrote in JBS, I plan to contribute that "FakeNUMA" mode eventually, but lack the time to polish it up for now. I hope the fix is simple and uncontested enough to go in quickly, since I would like to fix JDK 21 soon via backporting this patch.

Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:

 - revert blank line change
 - node_index preceding sizes
 - Merge branch 'master' into JDK-8351630-Fix-NUMA-association-for-the-duration-of-a-single-G1-Heap-allocation
 - node_index parameter should precede output parameters
 - start

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/23984/files
  - new: https://git.openjdk.org/jdk/pull/23984/files/c8870820..578002e0

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=23984&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23984&range=01-02

  Stats: 18976 lines in 265 files changed: 7389 ins; 10345 del; 1242 mod
  Patch: https://git.openjdk.org/jdk/pull/23984.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23984/head:pull/23984

PR: https://git.openjdk.org/jdk/pull/23984


More information about the hotspot-gc-dev mailing list