RFR: 8342848: Shenandoah: Marking bitmap may not be completely cleared in generational mode

Xiaolong Peng xpeng at openjdk.org
Wed Oct 23 16:11:22 UTC 2024


On Wed, 23 Oct 2024 06:07:00 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

> In the investigation of the crashe I saw in PR https://github.com/openjdk/shenandoah/pull/516, I happened to reproduce the crash GenShen TIP as well.  
> 
> The crash was reproduced multi times on both AWS r7g-4xlarge and r7i-4xlarge instances by running test below repeatedly:
> 
> 
> CONF=linux-aarch64-server-fastdebug  make clean test TEST=gc/stress/gcold/TestGCOldWithShenandoah.java#generational JTREG="REPEAT_COUNT=1000" 
> ``` 
> 
> Crash:
> 
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error (/home/xlpeng/repos/jdk-xlpeng/src/hotspot/share/gc/shenandoah/shenandoahConcurrentGC.cpp:642), pid=24134, tid=24158
> #  assert(_generation->is_bitmap_clear()) failed: need clear marking bitmap
> #
> # JRE version: OpenJDK Runtime Environment (24.0) (fastdebug build 24-internal-adhoc.xlpeng.jdk-xlpeng)
> # Java VM: OpenJDK 64-Bit Server VM (fastdebug 24-internal-adhoc.xlpeng.jdk-xlpeng, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-aarch64)
> # Problematic frame:
> # V  [libjvm.so+0x15eadc4]  ShenandoahConcurrentGC::op_init_mark()+0x358
> #
> # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /local/home/xlpeng/repos/jdk-xlpeng/build/linux-aarch64-server-fastdebug/test-support/jtreg_test_hotspot_jtreg_gc_stress_gcold_TestGCOldWithShenandoah_java_generational/scratch/0/hs_err_pid24134.log
> #
> # If you would like to submit a bug report, please visit:
> #   https://bugreport.java.com/bugreport/crash.jsp
> #
> 
> 
> With logging/instrumentation,  it seems to be caused by the one line  code `bool needs_reset = _generation->contains(region) || !region->is_affiliated(); `, considering bitmap reset is a concurrent operation, if is possible mutator thread changes the affiliation from FREE to YOUNG when  bitmap reset is running, both `_generation->contains(region)` and `!region->is_affiliated()` can be false when affiliation is FREE and mutator is updating it at the same time.
> 
> Logs from instrumentation:
> 
> [32.793s][info][gc          ] GC(19) Not reseting bitmap for YOUNG region (0x0000ffff8c1a6100)(affiliation before test: FREE)
> 
> ...
> 
> [32.807s][info][gc,task     ] GC(20) Using 8 of 8 workers for init marking
> [32.808s][info][gc          ] GC(20) Region (0x0000ffff8c1a6100) doesn't have clear bitmap, [1, 1, 1]
> 
> 
> The fix is simple, just need to swap the two tests,  `!region->is_affiliated()` ...

The test failure should not be caused by this change, spotted the same failure in other open PR:


java.lang.RuntimeException: expected testPhantom1 to be cleared
	at gc.shenandoah.TestReferenceRefersToShenandoah.fail(TestReferenceRefersToShenandoah.java:155)
	at gc.shenandoah.TestReferenceRefersToShenandoah.expectCleared(TestReferenceRefersToShenandoah.java:166)
	at gc.shenandoah.TestReferenceRefersToShenandoah.testConcurrentCollection(TestReferenceRefersToShenandoah.java:243)
	at gc.shenandoah.TestReferenceRefersToShenandoah.main(TestReferenceRefersToShenandoah.java:340)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:573)
	at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138)
	at java.base/java.lang.Thread.run(Thread.java:1576)

JavaTest Message: Test threw exception: java.lang.RuntimeException: expected testPhantom1 to be cleared
JavaTest Message: shutting down test

` ``

src/hotspot/share/gc/shenandoah/shenandoahGeneration.cpp line 80:

> 78:     ShenandoahMarkingContext* const ctx = heap->marking_context();
> 79:     while (region != nullptr) {
> 80:       bool needs_reset = !region->is_affiliated() || _generation->contains(region);

We should really read the affiliation only once for the whole test, but will have to create a new method to achieve that to keep the code clean and encapsulated.

-------------

PR Comment: https://git.openjdk.org/shenandoah/pull/523#issuecomment-2432746489
PR Review Comment: https://git.openjdk.org/shenandoah/pull/523#discussion_r1813106540


More information about the shenandoah-dev mailing list