RFR: 8351500: G1: NUMA migrations cause crashes in region allocation [v2]

Thomas Stuefe stuefe at openjdk.org
Wed Mar 12 13:33:55 UTC 2025


On Wed, 12 Mar 2025 13:27:17 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> For details, please see JBS issue.
>> 
>> _Please note that this bug only shows symptoms in JDK 21 and JDK 17! Due to code shuffling done as part of G1 region-local pinning work, the error does not show in JDKs 22 and later._
>> 
>> I originally planned to fix this just for JDK 21 and 17 (see https://github.com/openjdk/jdk21u-dev/pull/1460). However, I would rather have it fixed in the mainline, even though it is symptom-free. It is a lingering issue that may surface later if the code is ever changed. Plus, this prevents the fix from being accidentally overwritten in JDK 21 if we backport.
>> 
>> ----
>> 
>> The fix is simple: we fix (hah) the NUMA association for the full duration of a heap allocation in G1. That way, regardless of the OS scheduler moving the thread to a different NUMA node, we always use the same `G1AllocRegion` object, and changes in the control flow that rely on that won't break on NUMA.
>> 
>> This has the disadvantage of allocating memory from a node we are potentially moving away from. However, I argue that this is exceedingly rare, and if it happens, the OS will cope by eventually migrating the memory to the correct node.
>> 
>> ---
>> 
>> Testing:
>> 
>> Testing is difficult. See remark in JBS issue. 
>> 
>> I tested a modified version of this patch on JDK 21, where the error does cause crashes. I tested with an additional patch mimicking tons of NUMA node migrations. As I wrote in JBS, I plan to contribute that "FakeNUMA" mode eventually, but lack the time to polish it up for now. I hope the fix is simple and uncontested enough to go in quickly, since I would like to fix JDK 21 soon via backporting this patch.
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
> 
>   node_index parameter should precede output parameters

@tschatzl Thanks for looking at this. I tried to find a minimally invasive way to do the struct solution, but its not so easy. I want to keep the patch somewhat minimal and focused on the problem to solve, also to be able to downport it to 21 easily. So I did a compromise, moving node_index to a unified position in G1Allocator and, where necessary, G1CollectedHeap, while not touching other uses of this parameter.

While working on this, I found that the ambiguity between size_t and uint is annoying when dealing with parameters that are a mixture of those; there is not much type safety when using these raw types.

We have three different parameter groupings: {size, node_index}, {desired, min, node_index, output size} and {desired, min, output size}. Having a structure using all three does not do much for clarity (some members make no sense in some cases etc).

What I would do if I were to rework this:
- add a type for node index, ideally something that cannot be converted automatically to a size_t, but something that can be handed around via value. E.g. a `struct NodeIndex { uint _v; };`. It could be homed in g1NUMA.hpp.
- then add a struct for the typical min+desired+actual parameter group, three size_t.

I played around with this but the patch got far too big for this simple crash fix. Would be something for a cleanup RFE.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23984#issuecomment-2717898769


More information about the hotspot-gc-dev mailing list