Need help to fix a potential G1 crash in jdk11

Volker Simonis volker.simonis at gmail.com
Mon Jun 15 10:19:04 UTC 2020


Hi Poonam,

thanks for your assistance. Unfortunately, -XX:+VerifyRememberedSets
doesn't provide any additional information. I still get the
verification errors at "Verifying During GC (Remark after)"

Best regards,
Volker

On Mon, Jun 8, 2020 at 4:57 PM Poonam Parhar <poonam.bajaj at oracle.com> wrote:
>
> Hi Volker,
>
> Did you try running with -XX:+VerifyRememberedSets? This might tell if
> the problem is related to remset updates.
>
> Thanks,
> Poonam
>
> On 6/5/20 1:55 AM, Erik Österlund wrote:
> > Hi Volker,
> >
> > On 2020-06-03 20:18, Volker Simonis wrote:
> >> Unfortunately, "-XX:-ClassUnloading" doesn't help :(
> >
> > I am actually happy that did not help. I suspect a bug in that code
> > would be harder to track down; it is rather complicated.
> >
> >> I already saw two new crashes. The first one has 6 distinct Root
> >> locations pointing to one dead object:
> >>
> >> [863.222s][info ][gc,verify,start   ] Verifying During GC (Remark after)
> >> [863.222s][debug][gc,verify         ] Threads
> >> [863.224s][debug][gc,verify         ] Heap
> >> [863.224s][debug][gc,verify         ] Roots
> >> [863.229s][error][gc,verify         ] Root location
> >> 0x00007f11719174e7 points to dead obj 0x00000000f956dbd8
> >> [863.229s][error][gc,verify         ]
> >> org.antlr.v4.runtime.atn.PredictionContextCache
> >> [863.229s][error][gc,verify         ] {0x00000000f956dbd8} - klass:
> >> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> >> ...
> >> [863.229s][error][gc,verify         ] Root location
> >> 0x00007f1171921978 points to dead obj 0x00000000f956dbd8
> >> [863.229s][error][gc,verify         ]
> >> org.antlr.v4.runtime.atn.PredictionContextCache
> >> [863.229s][error][gc,verify         ] {0x00000000f956dbd8} - klass:
> >> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> >> [863.231s][debug][gc,verify         ] HeapRegionSets
> >> [863.231s][debug][gc,verify         ] HeapRegions
> >> [863.349s][error][gc,verify         ] Heap after failed verification
> >> (kind 0):
> >>
> >> The second crash has only two Root locations pointing to the same
> >> dead object but more than 40_000 fields in distinct objects pointing
> >> to more than 3_500 dead objects:
> >>
> >> [854.473s][info ][gc,verify,start   ] Verifying During GC (Remark after)
> >> [854.473s][debug][gc,verify         ] Threads
> >> [854.475s][debug][gc,verify         ] Heap
> >> [854.475s][debug][gc,verify         ] Roots
> >> [854.479s][error][gc,verify         ] Root location
> >> 0x00007f6e60461d5f points to dead obj 0x00000000fa874528
> >> [854.479s][error][gc,verify         ]
> >> org.antlr.v4.runtime.atn.PredictionContextCache
> >> [854.479s][error][gc,verify         ] {0x00000000fa874528} - klass:
> >> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> >> [854.479s][error][gc,verify         ] Root location
> >> 0x00007f6e60461d6d points to dead obj 0x00000000fa874528
> >> [854.479s][error][gc,verify         ]
> >> org.antlr.v4.runtime.atn.PredictionContextCache
> >> [854.479s][error][gc,verify         ] {0x00000000fa874528} - klass:
> >> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> >> [854.479s][error][gc,verify         ] Root location
> >> 0x00007f6e60462138 points to dead obj 0x00000000fa874528
> >> [854.479s][error][gc,verify         ]
> >> org.antlr.v4.runtime.atn.PredictionContextCache
> >> [854.479s][error][gc,verify         ] {0x00000000fa874528} - klass:
> >> 'org/antlr/v4/runtime/atn/PredictionContextCache'
> >> [854.482s][debug][gc,verify         ] HeapRegionSets
> >> [854.482s][debug][gc,verify         ] HeapRegions
> >> [854.484s][error][gc,verify         ] ----------
> >> [854.484s][error][gc,verify         ] Field 0x00000000fd363c70 of
> >> live obj 0x00000000fd363c58 in region [0x00000000fd300000,
> >> 0x00000000fd400000)
> >> [854.484s][error][gc,verify         ] class name
> >> org.antlr.v4.runtime.atn.ATNConfig
> >> [854.484s][error][gc,verify         ] points to dead obj
> >> 0x00000000fa88a540 in region [0x00000000fa800000, 0x00000000fa900000)
> >> [854.484s][error][gc,verify         ] class name
> >> org.antlr.v4.runtime.atn.ArrayPredictionContext
> >> [854.484s][error][gc,verify         ] ----------
> >> ...
> >> more than 40_000 fields in distinct objects pointing to more than
> >> 3_500 dead objects.
> >>
> >> So how can this happen. Is "-XX:+VerifyAfterGC" really reliable here?
> >
> > Naturally, it's hard to tell for definite what the issue is with only
> > these printouts.
> > However, we can make some observations from the printouts:
> >
> > Based on the address values of the "Root location" of the printouts,
> > each dead object
> > reported is pointed at from at least one misaligned oop. The only
> > misaligned oops in
> > HotSpot are nmethod oops embedded into the instruction stream as
> > immediates.
> > So this smells like some kind of nmethod oop processing bug in G1 to me.
> >
> > The Abortable Mixed GCs (https://openjdk.java.net/jeps/344) that went
> > into 12 changed
> > quite a bit of the nmethod oop scanning code. Perhaps the reason why
> > this stopped
> > reproducing in 12 is related to that. The nmethod oop processing code
> > introduced with
> > AMGC actually had a word tearing problem for nmethod oops, which was
> > fixed later with
> > https://bugs.openjdk.java.net/browse/JDK-8235305
> >
> > Hope these pointers help.
> >
> > /Erik
> >
> >> Thank you and best regards,
> >> Volker
> >>
> >>
> >> On Wed, Jun 3, 2020 at 7:14 PM Volker Simonis
> >> <volker.simonis at gmail.com <mailto:volker.simonis at gmail.com>> wrote:
> >>
> >>     Hi Erik,
> >>
> >>     thanks a lot for the quick response and the hint with
> >>     ClassUnloading. I've just started several runs of the test program
> >>     with "-XX:-ClassUnloading". I'll report back instantly once I have
> >>     some results.
> >>
> >>     Best regards,
> >>     Volker
> >>
> >>     On Wed, Jun 3, 2020 at 5:26 PM Erik Österlund
> >>     <erik.osterlund at oracle.com <mailto:erik.osterlund at oracle.com>>
> >> wrote:
> >>
> >>         Hi Volker,
> >>
> >>         In JDK 12, I changed quite a bit how G1 performs class
> >>         unloading, to a
> >>         new model.
> >>         Since the verification runs just after class unloading, I
> >>         guess it could
> >>         be interesting
> >>         to check if the error happens with -XX:-ClassUnloading as
> >>         well. If not,
> >>         then perhaps
> >>         some of my class unloading changes for G1 in JDK 12 fixed the
> >>         problem.
> >>
> >>         Just a gut feeling...
> >>
> >>         Thanks,
> >>         /Erik
> >>
> >>         On 2020-06-03 17:02, Volker Simonis wrote:
> >>         > Hi,
> >>         >
> >>         > I would appreciate some help/advice for debugging a
> >>         potential G1 crash in
> >>         > jdk 11. The crash usually occurs when running a proprietary
> >>         jar file for
> >>         > about 20-30 minutes and it happens in various parts of the
> >>         VM (C1- or
> >>         > C2-compiled code, interpreter, GC). Because the crash
> >>         locations are so
> >>         > different and because the customer which reported the issue
> >>         claimed that it
> >>         > doesn't happen with Parallel GC, I thought it might be a G1
> >>         issue. I
> >>         > couldn't reproduce the crash with jdk 12 and 14 (but with
> >>         jdk 11 and
> >>         > 11.0.7, OpenJDK and Oracle JDK). When looking at the G1
> >>         changes in jdk 12 I
> >>         > couldn't find any apparent bug fix which potentially solves
> >>         this problem
> >>         > but it may have been solved by one of the many G1 changes
> >>         which happened in
> >>         > jdk 12.
> >>         >
> >>         > I did run the reproducer with "-XX:+UnlockDiagnosticVMOptions
> >>         > -XX:+VerifyBeforeGC -XX:+VerifyAfterGC -XX:+VerifyDuringGC
> >>         > -XX:+CheckJNICalls -XX:+G1VerifyRSetsDuringFullGC
> >>         > -XX:+G1VerifyHeapRegionCodeRoots" and I indeed got
> >>         verification errors (see
> >>         > [1] for a complete hs_err file). Sometimes it's just a few
> >>         fields pointing
> >>         > to dead objects:
> >>         >
> >>         > [1035.782s][error][gc,verify         ] ----------
> >>         > [1035.782s][error][gc,verify         ] Field
> >>         0x00000000fb509148 of live obj
> >>         > 0x00000000fb509130 in region [0x00000000fb500000,
> >>         0x00000000fb600000)
> >>         > [1035.782s][error][gc,verify         ] class name
> >>         > org.antlr.v4.runtime.atn.ATNConfig
> >>         > [1035.782s][error][gc,verify         ] points to dead obj
> >>         > 0x00000000f9ba39b0 in region [0x00000000f9b00000,
> >>         0x00000000f9c00000)
> >>         > [1035.782s][error][gc,verify         ] class name
> >>         > org.antlr.v4.runtime.atn.SingletonPredictionContext
> >>         > [1035.782s][error][gc,verify         ] ----------
> >>         > [1035.783s][error][gc,verify         ] Field
> >>         0x00000000fb509168 of live obj
> >>         > 0x00000000fb509150 in region [0x00000000fb500000,
> >>         0x00000000fb600000)
> >>         > [1035.783s][error][gc,verify         ] class name
> >>         > org.antlr.v4.runtime.atn.ATNConfig
> >>         > [1035.783s][error][gc,verify         ] points to dead obj
> >>         > 0x00000000f9ba39b0 in region [0x00000000f9b00000,
> >>         0x00000000f9c00000)
> >>         > [1035.783s][error][gc,verify         ] class name
> >>         > org.antlr.v4.runtime.atn.SingletonPredictionContext
> >>         > [1035.783s][error][gc,verify         ] ----------
> >>         > ...
> >>         > [1043.928s][error][gc,verify         ] Heap Regions:
> >>         E=young(eden),
> >>         > S=young(survivor), O=old, HS=humongous(starts),
> >>         HC=humongous(continues),
> >>         > CS=collection set, F=free, A=archive, TAMS=top-at-mark-start
> >>         (previous,
> >>         > next)
> >>         > ...
> >>         > [1043.929s][error][gc,verify         ] |
> >> 79|0x00000000f9b00000,
> >>         > 0x00000000f9bfffe8, 0x00000000f9c00000| 99%| O| |TAMS
> >>         0x00000000f9bfffe8,
> >>         > 0x00000000f9b00000| Updating
> >>         > ...
> >>         > [1043.971s][error][gc,verify         ] |
> >> 105|0x00000000fb500000,
> >>         > 0x00000000fb54fc08, 0x00000000fb600000| 31%| S|CS|TAMS
> >>         0x00000000fb500000,
> >>         > 0x00000000fb500000| Complete
> >>         >
> >>         > but I also got verification errors with more than 30000
> >>         fields of distinct
> >>         > objects pointing to more than 1000 dead objects. How can
> >>         that happen? Is
> >>         > the verification always accurate or can this also be a
> >>         problem with the
> >>         > verification itself and I'm hunting the wrong problem?
> >>         >
> >>         > Sometimes I also saw verification errors where fields point
> >>         to objects in
> >>         > regions with "Untracked remset":
> >>         >
> >>         > [673.762s][error][gc,verify] ----------
> >>         > [673.762s][error][gc,verify] Field 0x00000000fca49298 of
> >>         live obj
> >>         > 0x00000000fca49280 in region [0x00000000fca0000
> >>         > 0, 0x00000000fcb00000)
> >>         > [673.762s][error][gc,verify] class name
> >>         org.antlr.v4.runtime.atn.ATNConfig
> >>         > [673.762s][error][gc,verify] points to obj
> >>         0x00000000f9d5a9a0 in region
> >>         >
> >> 81:(F)[0x00000000f9d00000,0x00000000f9d00000,0x00000000f9e00000]
> >>         remset
> >>         > Untracked
> >>         > [673.762s][error][gc,verify] ----------
> >>         >
> >>         > But they are by far not that common like the pointers to
> >>         dead objects. Once
> >>         > I even saw a "Root location" pointing to a dead object:
> >>         >
> >>         > [369.808s][error][gc,verify] Root location
> >>         0x00007f35bb33f1f8 points to
> >>         > dead obj 0x00000000f87fa200
> >>         > [369.808s][error][gc,verify]
> >>         org.antlr.v4.runtime.atn.PredictionContextCache
> >>         > [369.808s][error][gc,verify] {0x00000000f87fa200} - klass:
> >>         > 'org/antlr/v4/runtime/atn/PredictionContextCache'
> >>         > [369.850s][error][gc,verify] ----------
> >>         > [369.850s][error][gc,verify] Field 0x00000000fbc60900 of
> >>         live obj
> >>         > 0x00000000fbc608f0 in region [0x00000000fbc00000,
> >>         0x00000000fbd00000)
> >>         > [369.850s][error][gc,verify] class name
> >>         > org.antlr.v4.runtime.atn.ParserATNSimulator
> >>         > [369.850s][error][gc,verify] points to dead obj
> >>         0x00000000f87fa200 in
> >>         > region [0x00000000f8700000, 0x00000000f8800000)
> >>         > [369.850s][error][gc,verify] class name
> >>         > org.antlr.v4.runtime.atn.PredictionContextCache
> >>         > [369.850s][error][gc,verify] ----------
> >>         >
> >>         > All these verification errors occur after the Remark phase in
> >>         > G1ConcurrentMark::remark() at:
> >>         >
> >>         > verify_during_pause(G1HeapVerifier::G1VerifyRemark,
> >>         > VerifyOption_G1UsePrevMarking, "Remark after");
> >>         >
> >>         > V  [libjvm.so+0x6ca186]  report_vm_error(char const*, int,
> >>         char const*,
> >>         > char const*, ...)+0x106
> >>         > V  [libjvm.so+0x7d4a99]
> >>         G1HeapVerifier::verify(VerifyOption)+0x399
> >>         > V  [libjvm.so+0xe128bb] Universe::verify(VerifyOption, char
> >>         const*)+0x16b
> >>         > V  [libjvm.so+0x7d44ee]
> >>         >  G1HeapVerifier::verify(G1HeapVerifier::G1VerifyType,
> >>         VerifyOption, char
> >>         > const*)+0x9e
> >>         > V  [libjvm.so+0x7addcf]
> >>         >
> >>  G1ConcurrentMark::verify_during_pause(G1HeapVerifier::G1VerifyType,
> >>         > VerifyOption, char const*)+0x9f
> >>         > V  [libjvm.so+0x7b172e] G1ConcurrentMark::remark()+0x3be
> >>         > V  [libjvm.so+0xe6a5e1] VM_CGC_Operation::doit()+0x211
> >>         > V  [libjvm.so+0xe69908] VM_Operation::evaluate()+0xd8
> >>         > V  [libjvm.so+0xe6713f]
> >>         VMThread::evaluate_operation(VM_Operation*) [clone
> >>         > .constprop.54]+0xff
> >>         > V  [libjvm.so+0xe6764e]  VMThread::loop()+0x3be
> >>         > V  [libjvm.so+0xe67a7b]  VMThread::run()+0x7b
> >>         >
> >>         > The GC log output looks as follows:
> >>         > ...
> >>         > [1035.775s][info ][gc,verify,start   ] Verifying During GC
> >>         (Remark after)
> >>         > [1035.775s][debug][gc,verify         ] Threads
> >>         > [1035.776s][debug][gc,verify         ] Heap
> >>         > [1035.776s][debug][gc,verify         ] Roots
> >>         > [1035.782s][debug][gc,verify         ] HeapRegionSets
> >>         > [1035.782s][debug][gc,verify         ] HeapRegions
> >>         > [1035.782s][error][gc,verify         ] ----------
> >>         > ...
> >>         > A more complete GC log can be found here [2].
> >>         >
> >>         > For the field 0x00000000fb509148 of live obj
> >>         0x00000000fb509130 which
> >>         > points to the dead object 0x00000000f9ba39b0 I get the
> >> following
> >>         > information if I inspect them with clhsdb:
> >>         >
> >>         > hsdb> inspect 0x00000000fb509130
> >>         > instance of Oop for org/antlr/v4/runtime/atn/ATNConfig @
> >>         0x00000000fb509130
> >>         > @ 0x00000000fb509130 (size = 32)
> >>         > _mark: 13
> >>         > _metadata._compressed_klass: InstanceKlass for
> >>         > org/antlr/v4/runtime/atn/ATNConfig
> >>         > state: Oop for org/antlr/v4/runtime/atn/BasicState @
> >>         0x00000000f83ecfa8 Oop
> >>         > for org/antlr/v4/runtime/atn/BasicState @ 0x00000000f83ecfa8
> >>         > alt: 1
> >>         > context: Oop for
> >>         org/antlr/v4/runtime/atn/SingletonPredictionContext @
> >>         > 0x00000000f9ba39b0 Oop for
> >>         > org/antlr/v4/runtime/atn/SingletonPredictionContext @
> >>         0x00000000f9ba39b0
> >>         > reachesIntoOuterContext: 8
> >>         > semanticContext: Oop for
> >>         org/antlr/v4/runtime/atn/SemanticContext$Predicate
> >>         > @ 0x00000000f83d57c0 Oop for
> >>         > org/antlr/v4/runtime/atn/SemanticContext$Predicate @
> >>         0x00000000f83d57c0
> >>         >
> >>         > hsdb> inspect 0x00000000f9ba39b0
> >>         > instance of Oop for
> >>         org/antlr/v4/runtime/atn/SingletonPredictionContext @
> >>         > 0x00000000f9ba39b0 @ 0x00000000f9ba39b0 (size = 32)
> >>         > _mark: 41551306041
> >>         > _metadata._compressed_klass: InstanceKlass for
> >>         > org/antlr/v4/runtime/atn/SingletonPredictionContext
> >>         > id: 100635259
> >>         > cachedHashCode: 2005943142
> >>         > parent: Oop for
> >>         org/antlr/v4/runtime/atn/SingletonPredictionContext @
> >>         > 0x00000000f9ba01b0 Oop for
> >>         > org/antlr/v4/runtime/atn/SingletonPredictionContext @
> >>         0x00000000f9ba01b0
> >>         > returnState: 18228
> >>         >
> >>         > I could also reproduce the verification errors with a fast
> >>         debug build of
> >>         > 11.0.7 which I did run with "-XX:+CheckCompressedOops
> >>         -XX:+VerifyOops
> >>         > -XX:+G1VerifyCTCleanup -XX:+G1VerifyBitmaps" in addition to
> >>         the options
> >>         > mentioned before, but unfortunaltey the run didn't trigger
> >>         neither an
> >>         > assertion nor a different verification error.
> >>         >
> >>         > So to summarize, my basic questions are:
> >>         >   - has somebody else encountered similar crashes?
> >>         >   - is someone aware of specific changes in jdk12 which
> >>         might solve this
> >>         > problem?
> >>         >   - are the verification errors I'm seeing accurate or is it
> >>         possible to get
> >>         > false positives when running with
> >>         -XX:Verify{Before,During,After}GC ?
> >>         >
> >>         > Thanks for your patience,
> >>         > Volker
> >>         >
> >>         > [1]
> >>         >
> >> http://cr.openjdk.java.net/~simonis/webrevs/2020/jdk11-g1-crash/hs_err_pid28294.log
> >>         > [2]
> >>         >
> >> http://cr.openjdk.java.net/~simonis/webrevs/2020/jdk11-g1-crash/verify-error.log
> >>
> >
>



More information about the hotspot-gc-dev mailing list