Need help to fix a potential G1 crash in jdk11
Volker Simonis
volker.simonis at gmail.com
Wed Jun 3 18:18:38 UTC 2020
Unfortunately, "-XX:-ClassUnloading" doesn't help :(
I already saw two new crashes. The first one has 6 distinct Root locations
pointing to one dead object:
[863.222s][info ][gc,verify,start ] Verifying During GC (Remark after)
[863.222s][debug][gc,verify ] Threads
[863.224s][debug][gc,verify ] Heap
[863.224s][debug][gc,verify ] Roots
[863.229s][error][gc,verify ] Root location 0x00007f11719174e7
points to dead obj 0x00000000f956dbd8
[863.229s][error][gc,verify ]
org.antlr.v4.runtime.atn.PredictionContextCache
[863.229s][error][gc,verify ] {0x00000000f956dbd8} - klass:
'org/antlr/v4/runtime/atn/PredictionContextCache'
...
[863.229s][error][gc,verify ] Root location 0x00007f1171921978
points to dead obj 0x00000000f956dbd8
[863.229s][error][gc,verify ]
org.antlr.v4.runtime.atn.PredictionContextCache
[863.229s][error][gc,verify ] {0x00000000f956dbd8} - klass:
'org/antlr/v4/runtime/atn/PredictionContextCache'
[863.231s][debug][gc,verify ] HeapRegionSets
[863.231s][debug][gc,verify ] HeapRegions
[863.349s][error][gc,verify ] Heap after failed verification (kind
0):
The second crash has only two Root locations pointing to the same dead
object but more than 40_000 fields in distinct objects pointing to more
than 3_500 dead objects:
[854.473s][info ][gc,verify,start ] Verifying During GC (Remark after)
[854.473s][debug][gc,verify ] Threads
[854.475s][debug][gc,verify ] Heap
[854.475s][debug][gc,verify ] Roots
[854.479s][error][gc,verify ] Root location 0x00007f6e60461d5f
points to dead obj 0x00000000fa874528
[854.479s][error][gc,verify ]
org.antlr.v4.runtime.atn.PredictionContextCache
[854.479s][error][gc,verify ] {0x00000000fa874528} - klass:
'org/antlr/v4/runtime/atn/PredictionContextCache'
[854.479s][error][gc,verify ] Root location 0x00007f6e60461d6d
points to dead obj 0x00000000fa874528
[854.479s][error][gc,verify ]
org.antlr.v4.runtime.atn.PredictionContextCache
[854.479s][error][gc,verify ] {0x00000000fa874528} - klass:
'org/antlr/v4/runtime/atn/PredictionContextCache'
[854.479s][error][gc,verify ] Root location 0x00007f6e60462138
points to dead obj 0x00000000fa874528
[854.479s][error][gc,verify ]
org.antlr.v4.runtime.atn.PredictionContextCache
[854.479s][error][gc,verify ] {0x00000000fa874528} - klass:
'org/antlr/v4/runtime/atn/PredictionContextCache'
[854.482s][debug][gc,verify ] HeapRegionSets
[854.482s][debug][gc,verify ] HeapRegions
[854.484s][error][gc,verify ] ----------
[854.484s][error][gc,verify ] Field 0x00000000fd363c70 of live obj
0x00000000fd363c58 in region [0x00000000fd300000, 0x00000000fd400000)
[854.484s][error][gc,verify ] class name
org.antlr.v4.runtime.atn.ATNConfig
[854.484s][error][gc,verify ] points to dead obj 0x00000000fa88a540
in region [0x00000000fa800000, 0x00000000fa900000)
[854.484s][error][gc,verify ] class name
org.antlr.v4.runtime.atn.ArrayPredictionContext
[854.484s][error][gc,verify ] ----------
...
more than 40_000 fields in distinct objects pointing to more than 3_500
dead objects.
So how can this happen. Is "-XX:+VerifyAfterGC" really reliable here?
Thank you and best regards,
Volker
On Wed, Jun 3, 2020 at 7:14 PM Volker Simonis <volker.simonis at gmail.com>
wrote:
> Hi Erik,
>
> thanks a lot for the quick response and the hint with ClassUnloading. I've
> just started several runs of the test program with "-XX:-ClassUnloading".
> I'll report back instantly once I have some results.
>
> Best regards,
> Volker
>
> On Wed, Jun 3, 2020 at 5:26 PM Erik Österlund <erik.osterlund at oracle.com>
> wrote:
>
>> Hi Volker,
>>
>> In JDK 12, I changed quite a bit how G1 performs class unloading, to a
>> new model.
>> Since the verification runs just after class unloading, I guess it could
>> be interesting
>> to check if the error happens with -XX:-ClassUnloading as well. If not,
>> then perhaps
>> some of my class unloading changes for G1 in JDK 12 fixed the problem.
>>
>> Just a gut feeling...
>>
>> Thanks,
>> /Erik
>>
>> On 2020-06-03 17:02, Volker Simonis wrote:
>> > Hi,
>> >
>> > I would appreciate some help/advice for debugging a potential G1 crash
>> in
>> > jdk 11. The crash usually occurs when running a proprietary jar file for
>> > about 20-30 minutes and it happens in various parts of the VM (C1- or
>> > C2-compiled code, interpreter, GC). Because the crash locations are so
>> > different and because the customer which reported the issue claimed
>> that it
>> > doesn't happen with Parallel GC, I thought it might be a G1 issue. I
>> > couldn't reproduce the crash with jdk 12 and 14 (but with jdk 11 and
>> > 11.0.7, OpenJDK and Oracle JDK). When looking at the G1 changes in jdk
>> 12 I
>> > couldn't find any apparent bug fix which potentially solves this problem
>> > but it may have been solved by one of the many G1 changes which
>> happened in
>> > jdk 12.
>> >
>> > I did run the reproducer with "-XX:+UnlockDiagnosticVMOptions
>> > -XX:+VerifyBeforeGC -XX:+VerifyAfterGC -XX:+VerifyDuringGC
>> > -XX:+CheckJNICalls -XX:+G1VerifyRSetsDuringFullGC
>> > -XX:+G1VerifyHeapRegionCodeRoots" and I indeed got verification errors
>> (see
>> > [1] for a complete hs_err file). Sometimes it's just a few fields
>> pointing
>> > to dead objects:
>> >
>> > [1035.782s][error][gc,verify ] ----------
>> > [1035.782s][error][gc,verify ] Field 0x00000000fb509148 of live
>> obj
>> > 0x00000000fb509130 in region [0x00000000fb500000, 0x00000000fb600000)
>> > [1035.782s][error][gc,verify ] class name
>> > org.antlr.v4.runtime.atn.ATNConfig
>> > [1035.782s][error][gc,verify ] points to dead obj
>> > 0x00000000f9ba39b0 in region [0x00000000f9b00000, 0x00000000f9c00000)
>> > [1035.782s][error][gc,verify ] class name
>> > org.antlr.v4.runtime.atn.SingletonPredictionContext
>> > [1035.782s][error][gc,verify ] ----------
>> > [1035.783s][error][gc,verify ] Field 0x00000000fb509168 of live
>> obj
>> > 0x00000000fb509150 in region [0x00000000fb500000, 0x00000000fb600000)
>> > [1035.783s][error][gc,verify ] class name
>> > org.antlr.v4.runtime.atn.ATNConfig
>> > [1035.783s][error][gc,verify ] points to dead obj
>> > 0x00000000f9ba39b0 in region [0x00000000f9b00000, 0x00000000f9c00000)
>> > [1035.783s][error][gc,verify ] class name
>> > org.antlr.v4.runtime.atn.SingletonPredictionContext
>> > [1035.783s][error][gc,verify ] ----------
>> > ...
>> > [1043.928s][error][gc,verify ] Heap Regions: E=young(eden),
>> > S=young(survivor), O=old, HS=humongous(starts), HC=humongous(continues),
>> > CS=collection set, F=free, A=archive, TAMS=top-at-mark-start (previous,
>> > next)
>> > ...
>> > [1043.929s][error][gc,verify ] | 79|0x00000000f9b00000,
>> > 0x00000000f9bfffe8, 0x00000000f9c00000| 99%| O| |TAMS
>> 0x00000000f9bfffe8,
>> > 0x00000000f9b00000| Updating
>> > ...
>> > [1043.971s][error][gc,verify ] | 105|0x00000000fb500000,
>> > 0x00000000fb54fc08, 0x00000000fb600000| 31%| S|CS|TAMS
>> 0x00000000fb500000,
>> > 0x00000000fb500000| Complete
>> >
>> > but I also got verification errors with more than 30000 fields of
>> distinct
>> > objects pointing to more than 1000 dead objects. How can that happen? Is
>> > the verification always accurate or can this also be a problem with the
>> > verification itself and I'm hunting the wrong problem?
>> >
>> > Sometimes I also saw verification errors where fields point to objects
>> in
>> > regions with "Untracked remset":
>> >
>> > [673.762s][error][gc,verify] ----------
>> > [673.762s][error][gc,verify] Field 0x00000000fca49298 of live obj
>> > 0x00000000fca49280 in region [0x00000000fca0000
>> > 0, 0x00000000fcb00000)
>> > [673.762s][error][gc,verify] class name
>> org.antlr.v4.runtime.atn.ATNConfig
>> > [673.762s][error][gc,verify] points to obj 0x00000000f9d5a9a0 in region
>> > 81:(F)[0x00000000f9d00000,0x00000000f9d00000,0x00000000f9e00000] remset
>> > Untracked
>> > [673.762s][error][gc,verify] ----------
>> >
>> > But they are by far not that common like the pointers to dead objects.
>> Once
>> > I even saw a "Root location" pointing to a dead object:
>> >
>> > [369.808s][error][gc,verify] Root location 0x00007f35bb33f1f8 points to
>> > dead obj 0x00000000f87fa200
>> > [369.808s][error][gc,verify]
>> org.antlr.v4.runtime.atn.PredictionContextCache
>> > [369.808s][error][gc,verify] {0x00000000f87fa200} - klass:
>> > 'org/antlr/v4/runtime/atn/PredictionContextCache'
>> > [369.850s][error][gc,verify] ----------
>> > [369.850s][error][gc,verify] Field 0x00000000fbc60900 of live obj
>> > 0x00000000fbc608f0 in region [0x00000000fbc00000, 0x00000000fbd00000)
>> > [369.850s][error][gc,verify] class name
>> > org.antlr.v4.runtime.atn.ParserATNSimulator
>> > [369.850s][error][gc,verify] points to dead obj 0x00000000f87fa200 in
>> > region [0x00000000f8700000, 0x00000000f8800000)
>> > [369.850s][error][gc,verify] class name
>> > org.antlr.v4.runtime.atn.PredictionContextCache
>> > [369.850s][error][gc,verify] ----------
>> >
>> > All these verification errors occur after the Remark phase in
>> > G1ConcurrentMark::remark() at:
>> >
>> > verify_during_pause(G1HeapVerifier::G1VerifyRemark,
>> > VerifyOption_G1UsePrevMarking, "Remark after");
>> >
>> > V [libjvm.so+0x6ca186] report_vm_error(char const*, int, char const*,
>> > char const*, ...)+0x106
>> > V [libjvm.so+0x7d4a99] G1HeapVerifier::verify(VerifyOption)+0x399
>> > V [libjvm.so+0xe128bb] Universe::verify(VerifyOption, char
>> const*)+0x16b
>> > V [libjvm.so+0x7d44ee]
>> > G1HeapVerifier::verify(G1HeapVerifier::G1VerifyType, VerifyOption,
>> char
>> > const*)+0x9e
>> > V [libjvm.so+0x7addcf]
>> > G1ConcurrentMark::verify_during_pause(G1HeapVerifier::G1VerifyType,
>> > VerifyOption, char const*)+0x9f
>> > V [libjvm.so+0x7b172e] G1ConcurrentMark::remark()+0x3be
>> > V [libjvm.so+0xe6a5e1] VM_CGC_Operation::doit()+0x211
>> > V [libjvm.so+0xe69908] VM_Operation::evaluate()+0xd8
>> > V [libjvm.so+0xe6713f] VMThread::evaluate_operation(VM_Operation*)
>> [clone
>> > .constprop.54]+0xff
>> > V [libjvm.so+0xe6764e] VMThread::loop()+0x3be
>> > V [libjvm.so+0xe67a7b] VMThread::run()+0x7b
>> >
>> > The GC log output looks as follows:
>> > ...
>> > [1035.775s][info ][gc,verify,start ] Verifying During GC (Remark
>> after)
>> > [1035.775s][debug][gc,verify ] Threads
>> > [1035.776s][debug][gc,verify ] Heap
>> > [1035.776s][debug][gc,verify ] Roots
>> > [1035.782s][debug][gc,verify ] HeapRegionSets
>> > [1035.782s][debug][gc,verify ] HeapRegions
>> > [1035.782s][error][gc,verify ] ----------
>> > ...
>> > A more complete GC log can be found here [2].
>> >
>> > For the field 0x00000000fb509148 of live obj 0x00000000fb509130 which
>> > points to the dead object 0x00000000f9ba39b0 I get the following
>> > information if I inspect them with clhsdb:
>> >
>> > hsdb> inspect 0x00000000fb509130
>> > instance of Oop for org/antlr/v4/runtime/atn/ATNConfig @
>> 0x00000000fb509130
>> > @ 0x00000000fb509130 (size = 32)
>> > _mark: 13
>> > _metadata._compressed_klass: InstanceKlass for
>> > org/antlr/v4/runtime/atn/ATNConfig
>> > state: Oop for org/antlr/v4/runtime/atn/BasicState @ 0x00000000f83ecfa8
>> Oop
>> > for org/antlr/v4/runtime/atn/BasicState @ 0x00000000f83ecfa8
>> > alt: 1
>> > context: Oop for org/antlr/v4/runtime/atn/SingletonPredictionContext @
>> > 0x00000000f9ba39b0 Oop for
>> > org/antlr/v4/runtime/atn/SingletonPredictionContext @ 0x00000000f9ba39b0
>> > reachesIntoOuterContext: 8
>> > semanticContext: Oop for
>> org/antlr/v4/runtime/atn/SemanticContext$Predicate
>> > @ 0x00000000f83d57c0 Oop for
>> > org/antlr/v4/runtime/atn/SemanticContext$Predicate @ 0x00000000f83d57c0
>> >
>> > hsdb> inspect 0x00000000f9ba39b0
>> > instance of Oop for org/antlr/v4/runtime/atn/SingletonPredictionContext
>> @
>> > 0x00000000f9ba39b0 @ 0x00000000f9ba39b0 (size = 32)
>> > _mark: 41551306041
>> > _metadata._compressed_klass: InstanceKlass for
>> > org/antlr/v4/runtime/atn/SingletonPredictionContext
>> > id: 100635259
>> > cachedHashCode: 2005943142
>> > parent: Oop for org/antlr/v4/runtime/atn/SingletonPredictionContext @
>> > 0x00000000f9ba01b0 Oop for
>> > org/antlr/v4/runtime/atn/SingletonPredictionContext @ 0x00000000f9ba01b0
>> > returnState: 18228
>> >
>> > I could also reproduce the verification errors with a fast debug build
>> of
>> > 11.0.7 which I did run with "-XX:+CheckCompressedOops -XX:+VerifyOops
>> > -XX:+G1VerifyCTCleanup -XX:+G1VerifyBitmaps" in addition to the options
>> > mentioned before, but unfortunaltey the run didn't trigger neither an
>> > assertion nor a different verification error.
>> >
>> > So to summarize, my basic questions are:
>> > - has somebody else encountered similar crashes?
>> > - is someone aware of specific changes in jdk12 which might solve this
>> > problem?
>> > - are the verification errors I'm seeing accurate or is it possible
>> to get
>> > false positives when running with -XX:Verify{Before,During,After}GC ?
>> >
>> > Thanks for your patience,
>> > Volker
>> >
>> > [1]
>> >
>> http://cr.openjdk.java.net/~simonis/webrevs/2020/jdk11-g1-crash/hs_err_pid28294.log
>> > [2]
>> >
>> http://cr.openjdk.java.net/~simonis/webrevs/2020/jdk11-g1-crash/verify-error.log
>>
>>
More information about the hotspot-gc-dev
mailing list