Need help to fix a potential G1 crash in jdk11

Volker Simonis volker.simonis at gmail.com
Wed Jun 3 15:02:52 UTC 2020


Hi,

I would appreciate some help/advice for debugging a potential G1 crash in
jdk 11. The crash usually occurs when running a proprietary jar file for
about 20-30 minutes and it happens in various parts of the VM (C1- or
C2-compiled code, interpreter, GC). Because the crash locations are so
different and because the customer which reported the issue claimed that it
doesn't happen with Parallel GC, I thought it might be a G1 issue. I
couldn't reproduce the crash with jdk 12 and 14 (but with jdk 11 and
11.0.7, OpenJDK and Oracle JDK). When looking at the G1 changes in jdk 12 I
couldn't find any apparent bug fix which potentially solves this problem
but it may have been solved by one of the many G1 changes which happened in
jdk 12.

I did run the reproducer with "-XX:+UnlockDiagnosticVMOptions
-XX:+VerifyBeforeGC -XX:+VerifyAfterGC -XX:+VerifyDuringGC
-XX:+CheckJNICalls -XX:+G1VerifyRSetsDuringFullGC
-XX:+G1VerifyHeapRegionCodeRoots" and I indeed got verification errors (see
[1] for a complete hs_err file). Sometimes it's just a few fields pointing
to dead objects:

[1035.782s][error][gc,verify         ] ----------
[1035.782s][error][gc,verify         ] Field 0x00000000fb509148 of live obj
0x00000000fb509130 in region [0x00000000fb500000, 0x00000000fb600000)
[1035.782s][error][gc,verify         ] class name
org.antlr.v4.runtime.atn.ATNConfig
[1035.782s][error][gc,verify         ] points to dead obj
0x00000000f9ba39b0 in region [0x00000000f9b00000, 0x00000000f9c00000)
[1035.782s][error][gc,verify         ] class name
org.antlr.v4.runtime.atn.SingletonPredictionContext
[1035.782s][error][gc,verify         ] ----------
[1035.783s][error][gc,verify         ] Field 0x00000000fb509168 of live obj
0x00000000fb509150 in region [0x00000000fb500000, 0x00000000fb600000)
[1035.783s][error][gc,verify         ] class name
org.antlr.v4.runtime.atn.ATNConfig
[1035.783s][error][gc,verify         ] points to dead obj
0x00000000f9ba39b0 in region [0x00000000f9b00000, 0x00000000f9c00000)
[1035.783s][error][gc,verify         ] class name
org.antlr.v4.runtime.atn.SingletonPredictionContext
[1035.783s][error][gc,verify         ] ----------
...
[1043.928s][error][gc,verify         ] Heap Regions: E=young(eden),
S=young(survivor), O=old, HS=humongous(starts), HC=humongous(continues),
CS=collection set, F=free, A=archive, TAMS=top-at-mark-start (previous,
next)
...
[1043.929s][error][gc,verify         ] |  79|0x00000000f9b00000,
0x00000000f9bfffe8, 0x00000000f9c00000| 99%| O|  |TAMS 0x00000000f9bfffe8,
0x00000000f9b00000| Updating
...
[1043.971s][error][gc,verify         ] | 105|0x00000000fb500000,
0x00000000fb54fc08, 0x00000000fb600000| 31%| S|CS|TAMS 0x00000000fb500000,
0x00000000fb500000| Complete

but I also got verification errors with more than 30000 fields of distinct
objects pointing to more than 1000 dead objects. How can that happen? Is
the verification always accurate or can this also be a problem with the
verification itself and I'm hunting the wrong problem?

Sometimes I also saw verification errors where fields point to objects in
regions with "Untracked remset":

[673.762s][error][gc,verify] ----------
[673.762s][error][gc,verify] Field 0x00000000fca49298 of live obj
0x00000000fca49280 in region [0x00000000fca0000
0, 0x00000000fcb00000)
[673.762s][error][gc,verify] class name org.antlr.v4.runtime.atn.ATNConfig
[673.762s][error][gc,verify] points to obj 0x00000000f9d5a9a0 in region
81:(F)[0x00000000f9d00000,0x00000000f9d00000,0x00000000f9e00000] remset
Untracked
[673.762s][error][gc,verify] ----------

But they are by far not that common like the pointers to dead objects. Once
I even saw a "Root location" pointing to a dead object:

[369.808s][error][gc,verify] Root location 0x00007f35bb33f1f8 points to
dead obj 0x00000000f87fa200
[369.808s][error][gc,verify] org.antlr.v4.runtime.atn.PredictionContextCache
[369.808s][error][gc,verify] {0x00000000f87fa200} - klass:
'org/antlr/v4/runtime/atn/PredictionContextCache'
[369.850s][error][gc,verify] ----------
[369.850s][error][gc,verify] Field 0x00000000fbc60900 of live obj
0x00000000fbc608f0 in region [0x00000000fbc00000, 0x00000000fbd00000)
[369.850s][error][gc,verify] class name
org.antlr.v4.runtime.atn.ParserATNSimulator
[369.850s][error][gc,verify] points to dead obj 0x00000000f87fa200 in
region [0x00000000f8700000, 0x00000000f8800000)
[369.850s][error][gc,verify] class name
org.antlr.v4.runtime.atn.PredictionContextCache
[369.850s][error][gc,verify] ----------

All these verification errors occur after the Remark phase in
G1ConcurrentMark::remark() at:

verify_during_pause(G1HeapVerifier::G1VerifyRemark,
VerifyOption_G1UsePrevMarking, "Remark after");

V  [libjvm.so+0x6ca186]  report_vm_error(char const*, int, char const*,
char const*, ...)+0x106
V  [libjvm.so+0x7d4a99]  G1HeapVerifier::verify(VerifyOption)+0x399
V  [libjvm.so+0xe128bb]  Universe::verify(VerifyOption, char const*)+0x16b
V  [libjvm.so+0x7d44ee]
 G1HeapVerifier::verify(G1HeapVerifier::G1VerifyType, VerifyOption, char
const*)+0x9e
V  [libjvm.so+0x7addcf]
 G1ConcurrentMark::verify_during_pause(G1HeapVerifier::G1VerifyType,
VerifyOption, char const*)+0x9f
V  [libjvm.so+0x7b172e]  G1ConcurrentMark::remark()+0x3be
V  [libjvm.so+0xe6a5e1]  VM_CGC_Operation::doit()+0x211
V  [libjvm.so+0xe69908]  VM_Operation::evaluate()+0xd8
V  [libjvm.so+0xe6713f]  VMThread::evaluate_operation(VM_Operation*) [clone
.constprop.54]+0xff
V  [libjvm.so+0xe6764e]  VMThread::loop()+0x3be
V  [libjvm.so+0xe67a7b]  VMThread::run()+0x7b

The GC log output looks as follows:
...
[1035.775s][info ][gc,verify,start   ] Verifying During GC (Remark after)
[1035.775s][debug][gc,verify         ] Threads
[1035.776s][debug][gc,verify         ] Heap
[1035.776s][debug][gc,verify         ] Roots
[1035.782s][debug][gc,verify         ] HeapRegionSets
[1035.782s][debug][gc,verify         ] HeapRegions
[1035.782s][error][gc,verify         ] ----------
...
A more complete GC log can be found here [2].

For the field 0x00000000fb509148 of live obj 0x00000000fb509130 which
points to the dead object 0x00000000f9ba39b0 I get the following
information if I inspect them with clhsdb:

hsdb> inspect 0x00000000fb509130
instance of Oop for org/antlr/v4/runtime/atn/ATNConfig @ 0x00000000fb509130
@ 0x00000000fb509130 (size = 32)
_mark: 13
_metadata._compressed_klass: InstanceKlass for
org/antlr/v4/runtime/atn/ATNConfig
state: Oop for org/antlr/v4/runtime/atn/BasicState @ 0x00000000f83ecfa8 Oop
for org/antlr/v4/runtime/atn/BasicState @ 0x00000000f83ecfa8
alt: 1
context: Oop for org/antlr/v4/runtime/atn/SingletonPredictionContext @
0x00000000f9ba39b0 Oop for
org/antlr/v4/runtime/atn/SingletonPredictionContext @ 0x00000000f9ba39b0
reachesIntoOuterContext: 8
semanticContext: Oop for org/antlr/v4/runtime/atn/SemanticContext$Predicate
@ 0x00000000f83d57c0 Oop for
org/antlr/v4/runtime/atn/SemanticContext$Predicate @ 0x00000000f83d57c0

hsdb> inspect 0x00000000f9ba39b0
instance of Oop for org/antlr/v4/runtime/atn/SingletonPredictionContext @
0x00000000f9ba39b0 @ 0x00000000f9ba39b0 (size = 32)
_mark: 41551306041
_metadata._compressed_klass: InstanceKlass for
org/antlr/v4/runtime/atn/SingletonPredictionContext
id: 100635259
cachedHashCode: 2005943142
parent: Oop for org/antlr/v4/runtime/atn/SingletonPredictionContext @
0x00000000f9ba01b0 Oop for
org/antlr/v4/runtime/atn/SingletonPredictionContext @ 0x00000000f9ba01b0
returnState: 18228

I could also reproduce the verification errors with a fast debug build of
11.0.7 which I did run with "-XX:+CheckCompressedOops -XX:+VerifyOops
-XX:+G1VerifyCTCleanup -XX:+G1VerifyBitmaps" in addition to the options
mentioned before, but unfortunaltey the run didn't trigger neither an
assertion nor a different verification error.

So to summarize, my basic questions are:
 - has somebody else encountered similar crashes?
 - is someone aware of specific changes in jdk12 which might solve this
problem?
 - are the verification errors I'm seeing accurate or is it possible to get
false positives when running with -XX:Verify{Before,During,After}GC ?

Thanks for your patience,
Volker

[1]
http://cr.openjdk.java.net/~simonis/webrevs/2020/jdk11-g1-crash/hs_err_pid28294.log
[2]
http://cr.openjdk.java.net/~simonis/webrevs/2020/jdk11-g1-crash/verify-error.log



More information about the hotspot-gc-dev mailing list