Tracking potential GC bugs

Wed Sep 20 08:12:07 UTC 2017

Hello all,

I am not sure if this is the best forum for soliciting advice on how to hunt potential GC bugs, but this was the best I could come up with.

Ideas about better forums are welcome.

This post is about bugs https://bugs.openjdk.java.net/browse/JDK-8172756 and https://bugs.openjdk.java.net/browse/JDK-8143310

which both we're seeing when using G1 GC. We've seen this problem on 92/112/131/141 releases of JDK 8.

Currently, we have a situation where we're usually able to reproduce the maybe crash once in a three days by running the whole application

and mimicing actual usage with scripts, with no hope in sight for any shorter / simpler reproduction.

As the crash was in oopDesc::size(), we tried back-porting JDK-8168914 even though our crash was elsewhere, adding memory fence to

reading/writing the class and then trying to identify if the actual pointed-to class was invalid (with Metaspace::contains(obj->klass())).

These changes can be seen in this changeset: https://gist.github.com/jmiettinen/3ae14b2cfa509a0f17efb35e5503c17b

If I've understood corretly the JDK code, the OOPs for which size-call crashes are from situations where GC goes through some set of

objects (let's call them BadObjects) marking all that they refer grey / copying them to survivor space.

So we'll end up with something like this:

class BadObject {

     char* ptr;

}

where bad_object.ptr points to some garbled value.

This raises at least following hypotheses:

1. Some stage of garbage collection misses updating references in a BadObject. I don't know if G1 does that kind of pointer updating.

2. Some part of the software (native code, anything using Unsafe, miscompiled Java-code) garbles the pointer.

For the first hypothesis, we've so far tried turning _hrm.verify_optional() and verify_region_sets_optional() in in

G1CollectedHeap::do_collection_pause_at_safepoint on in production, but they have not caught any irregularities.

Could there be other causes? Are there any suggestions for next steps given how hard the reproduction is?

We're unable to move to JDK9 and try reproduction there as we're running JRuby and it's not working at the moment with JDK9.

Used JVM parameters are:

-Xms3000G -Xmx3000G -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m -XX:+UseCodeCacheFlushing -XX:MaxDirectMemorySize=20G -XX:AutoBoxCacheMax=8192 -XX:MetaspaceSize=512M -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=1 -XX:G1MaxNewSizePercent=80 -XX:G1MixedGCLiveThresholdPercent=90 -XX:G1HeapWastePercent=5 -XX:G1MixedGCCountTarget=4 -XX:MaxGCPauseMillis=3000 -verbose:gc -XX:-PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:G1ReservePercent=20 -XX:SurvivorRatio=1 -XX:+UseGCOverheadLimit -XX:SoftRefLRUPolicyMSPerMB=10 -Xloggc:/opt/apps/customer/shared/log/gc.log -XX:-HeapDumpOnOutOfMemoryError -Djruby.compile.invokedynamic=false -Djruby.ji.objectProxyCache=false