G1 GC verification failure in production run

Ashutosh Mehra asmehra at redhat.com
Wed Jun 26 18:04:05 UTC 2024


Sanne (cc-ed here) reported (using internal communication) he was getting
random crashes in his setup that utilizes the premain feature when
running with G1 gc.
I was able to dig further into it and found the difference in heap size in
production run compared to the training run can cause g1 gc barriers to not
work properly, resulting in dangling references.
An easy way to reproduce is to run the quarkus-getting-started [0] app
using 1-step workflow and pass  -XX:+UseG1GC -Xmx128m -XX:+VerifyBeforeGC
option to the production run.
It results in gc verification failure:

[4.703s][error][gc,verify] ----------
[4.703s][error][gc,verify] Missing rem set entry:
[4.703s][error][gc,verify] Field 0x0000000087f47148 of obj
0x0000000087f47120 in region
125:(O)[0x0000000087f00000,0x0000000088000000,0x0000000088000000]
[4.703s][error][gc,verify] java.util.concurrent.ConcurrentHashMap
[4.703s][error][gc,verify] {0x0000000087f47120} - klass:
'java/util/concurrent/ConcurrentHashMap'
[4.703s][error][gc,verify]  - ---- fields (total size 8 words):
[4.703s][error][gc,verify]  - transient 'keySet' 'Ljava/util/Set;' @12
 null (0x00000000)
[4.703s][error][gc,verify]  - transient 'values' 'Ljava/util/Collection;'
@16  null (0x00000000)
[4.703s][error][gc,verify]  - private volatile transient 'sizeCtl' 'I' @20
 24 (0x00000018)
[4.703s][error][gc,verify]  - private volatile transient 'baseCount' 'J'
@24  18 (0x0000000000000012)
[4.703s][error][gc,verify]  - private volatile transient 'transferIndex'
'I' @32  0 (0x00000000)
[4.703s][error][gc,verify]  - private volatile transient 'cellsBusy' 'I'
@36  0 (0x00000000)
[4.703s][error][gc,verify]  - volatile transient 'table'
'[Ljava/util/concurrent/ConcurrentHashMap$Node;' @40  a
'java/util/concurrent/ConcurrentHashMap$Node'[32] {0x0000000087ce4208}
(0x00f9c841)
[4.703s][error][gc,verify]  - private volatile transient 'nextTable'
'[Ljava/util/concurrent/ConcurrentHashMap$Node;' @44  null (0x00000000)
[4.703s][error][gc,verify]  - private volatile transient 'counterCells'
'[Ljava/util/concurrent/ConcurrentHashMap$CounterCell;' @48  null
(0x00000000)
[4.703s][error][gc,verify]  - private transient 'keySet'
'Ljava/util/concurrent/ConcurrentHashMap$KeySetView;' @52  null (0x00000000)
[4.703s][error][gc,verify]  - private transient 'values'
'Ljava/util/concurrent/ConcurrentHashMap$ValuesView;' @56  null (0x00000000)
[4.703s][error][gc,verify]  - private transient 'entrySet'
'Ljava/util/concurrent/ConcurrentHashMap$EntrySetView;' @60  null
(0x00000000)
[4.703s][error][gc,verify] points to obj 0x0000000087ce4208 in region
122:(E)[0x0000000087c00000,0x0000000087d00000,0x0000000087d00000] remset
Complete
[4.703s][error][gc,verify] [Ljava.util.concurrent.ConcurrentHashMap$Node;
[4.703s][error][gc,verify] {0x0000000087ce4208} - klass:
'java/util/concurrent/ConcurrentHashMap$Node'[]
[4.703s][error][gc,verify]  - length: 32
[4.703s][error][gc,verify]  -   0 : null
[4.703s][error][gc,verify]  -   1 : null
[4.703s][error][gc,verify]  -   2 : null
[4.703s][error][gc,verify]  -   3 : null
[4.703s][error][gc,verify]  -   4 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce3df0}
[4.703s][error][gc,verify]  -   5 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce3ec0}
[4.703s][error][gc,verify]  -   6 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce44e8}
[4.703s][error][gc,verify]  -   7 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce4480}
[4.703s][error][gc,verify]  -   8 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce41e8}
[4.703s][error][gc,verify]  -   9 : null
[4.703s][error][gc,verify]  -  10 : null
[4.703s][error][gc,verify]  -  11 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce43b0}
[4.703s][error][gc,verify]  -  12 : null
[4.703s][error][gc,verify]  -  13 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce3e58}
[4.703s][error][gc,verify]  -  14 : null
[4.703s][error][gc,verify]  -  15 : null
[4.703s][error][gc,verify]  -  16 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce4418}
[4.703s][error][gc,verify]  -  17 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce4138}
[4.703s][error][gc,verify]  -  18 : null
[4.703s][error][gc,verify]  -  19 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce4088}
[4.703s][error][gc,verify]  -  20 : null
[4.703s][error][gc,verify]  -  21 : null
[4.703s][error][gc,verify]  -  22 : null
[4.703s][error][gc,verify]  -  23 : null
[4.703s][error][gc,verify]  -  24 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce42b8}
[4.703s][error][gc,verify]  -  25 : null
[4.703s][error][gc,verify]  -  26 : null
[4.703s][error][gc,verify]  -  27 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce41a0}
[4.703s][error][gc,verify]  -  28 : null
[4.703s][error][gc,verify]  -  29 : null
[4.703s][error][gc,verify]  -  30 : null
[4.703s][error][gc,verify]  -  31 : a
'java/util/concurrent/ConcurrentHashMap$Node'{0x0000000087ce3f70}
[4.703s][error][gc,verify] Obj head CV = 255, field CV = 255.
[4.703s][error][gc,verify] ----------
[4.723s][error][gc,verify] Heap after failed verification (kind 0):
[4.723s][error][gc,verify]  garbage-first heap   total reserved 131072K,
committed 131072K, used 24363K [0x0000000080200000, 0x0000000088200000)
[4.723s][error][gc,verify]   region size 1024K, 21 young (21504K), 0
survivors (0K)
[4.723s][error][gc,verify]  Metaspace       used 3891K, committed 4096K,
reserved 1114112K
[4.723s][error][gc,verify]   class space    used 63K, committed 192K,
reserved 1048576K

It turns out g1 barriers rely on G1HeapRegion::LogOfHRGrainBytes [1] [2] to
determine if a reference and the pointee both belong to the same heap
region or not.
If they are in the same region, then the card for the reference is not
marked dirty.
The value of G1HeapRegion::LogOfHRGrainBytes is embedded in the compiled
code generated by C1 and C2.

Interestingly G1HeapRegion::LogOfHRGrainBytes [3] is determined
ergonomically based on the heap size.
For instance on my system when Xmx is not specified,
G1HeapRegion::LogOfHRGrainBytes is 23 (for heap region size of 8M).
For Xmx128m, the G1HeapRegion::LogOfHRGrainBytes is 20 (for heap region
size of 1M).

So if Xmx is not specified in the training run then the AOT code would have
embedded 23 as the value for G1HeapRegion::LogOfHRGrainBytes.
During production run if Xmx128m is used, then the g1 barrier in the AOT
code could incorrectly determine that the reference and the pointee are in
the same region when they are not,
and would then skip marking the card as dirty, resulting in gc verification
failures seen above.

As a workaround for now, we can manually specify G1HeapRegionSize to be the
same in both training and production run, but ideally we shouldn't need to
do that.
For a long term solution Andrew Dinn suggested loading the value for
G1HeapRegion::LogOfHRGrainBytes from a thread field.
This way we can avoid patching the code or the penalty of a runtime call to
fetch the value.
Any other ideas on how to fix it?

[0]
https://github.com/openjdk/leyden/tree/premain/test/hotspot/jtreg/premain/quarkus-getting-started
[1]
https://github.com/openjdk/leyden/blob/d415c65b664cf31167bfd843dcf1805530b89e01/src/hotspot/share/gc/g1/c2/g1BarrierSetC2.cpp#L457
[2]
https://github.com/openjdk/leyden/blob/d415c65b664cf31167bfd843dcf1805530b89e01/src/hotspot/share/gc/g1/c1/g1BarrierSetC1.cpp#L161
[3]
 https://github.com/openjdk/leyden/blob/d415c65b664cf31167bfd843dcf1805530b89e01/src/hotspot/share/gc/g1/g1HeapRegion.cpp#L83
<https://github.com/openjdk/leyden/blob/d415c65b664cf31167bfd843dcf1805530b89e01/src/hotspot/share/gc/g1/g1HeapRegion.cpp#L83>

Thanks,
- Ashutosh Mehra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/leyden-dev/attachments/20240626/ec699171/attachment.htm>


More information about the leyden-dev mailing list