Crash in openjdk8 while using ShenandoahGC

Tue Sep 5 15:25:53 UTC 2017

Hi Peter,

On 09/05/2017 04:40 PM, Peter Beaman wrote:
> hs_err file:
> https://drive.google.com/open?id=0B5jQhhBr4V-LZ0V5NTBtNWl6cDg

The suspicious part is this:

VM_Operation (0x00007f73746ecc10): RevokeBias, mode: safepoint, requested by thread 0x00007f732e69d800

Which means we were trying to reach the safepoint for biased lock revocation, and failing there.
Now, Shenandoah thinks this is what happens:

Shenandoah Heap
 21979136K total, 21979136K committed, 21976057K used
 8192K regions, 2683 active, 2683 total
Status: evacuating cancelled  <----- oops

In our code, when evacuation is cancelled, there are some tricky things happen with application
threads, and that might explain why we see the safepoint stall ultimately crashing the VM. The
corroborating factor seems to be that we entered the biased locking revocation at the same time, and
this blows up. We'd need to do more extensive stress tests with biased locking enabled, this is on us.

> gc log:
> https://drive.google.com/open?id=0B5jQhhBr4V-LWi1vWURFWHd2Vm8

...also I see there are lots of safepoints in between the GC-induced ones:

2017-09-01T03:29:02.071+0000: 3691.233: Total time for which application threads were stopped:
0.0257092 seconds, Stopping threads took: 0.0233301 seconds
2017-09-01T03:29:03.329+0000: 3692.491: Total time for which application threads were stopped:
0.0409040 seconds, Stopping threads took: 0.0384744 seconds

Given these two observations, I am sure -XX:-UseBiasedLocking would improve latency and stability
for your app. Can you try it? I would recommend you to run with -XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=10 to get the more detailed dissection of safepoints experienced,
which would include VM operations names.

> You will see in the gc.log file that the application experienced very
> frequent multi-second "Pause Full" pauses. I suspect our application is
> simply overwhelming the ability of the garbage collector to complete its
> evacuations. 

Could be. The default "adaptive" heuristics would try to reserve some free space to absorb the
allocation spike. See our Wiki for details how to request more free space. Or, just give it more
heap. Shenandoah works better with more heap, in contrast to other collectors :)

Thanks,
-Aleksey