RFR (S) 8181143: Introduce diagnostic flag to abort VM on too long VM operations

Mon Nov 19 06:13:53 UTC 2018

Hi Aleksey,

First the synopsis is not accurate:

"Introduce diagnostic flag to abort VM on too long VM operations"

You're not just introducing one diagnostic flag, your introducing the 
entire VM operation timeout mechanism, including two product flags and 
one diagnostic. So the CR needs to reflect that clearly and you will 
need a CSR request to add the two product flags. And they will need to 
be documented.

Three flags just for this makes me cringe. (Yes it mirrors the safepoint 
timeout flags but if that were proposed today I'd have the same reaction.)

On 17/11/2018 2:30 am, Aleksey Shipilev wrote:
> RFE:
>    https://bugs.openjdk.java.net/browse/JDK-8181143
> 
> Webrev:
>    http://cr.openjdk.java.net/~shade/8181143/webrev.03/
> 
> SafepointTimeout is nice to discover long/stuck safepoint syncs. But it is as important to discover
> long/stuck VM operations. This patch complements the timeout machinery with tracking VM operation
> themselves. Among other things, this allows to terminate the VM when very long VM operation is
> blocking progress. High-availability users would enjoy fail-fast JVM -- in fact, the original
> prototype was done as request from Apache Ignite developers.
> 
> Example with -XX:+VMOperationTimeout -XX:VMOperationTimeoutDelay=100 -XX:+AbortVMOnVMOperationTimeout:
> 
> [3.117s][info][gc,start] GC(2) Pause Young (Normal) (G1 Evacuation Pause)
> [3.224s][warning][vmthread] VM Operation G1CollectForAllocation took longer than 100 ms
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  Internal Error (/home/sh/jdk-jdk/src/hotspot/share/runtime/vmThread.cpp:218), pid=2536, tid=2554
> #  fatal error: VM Operation G1CollectForAllocation took longer than 100 ms
> #

It's not safe to access vm_safepoint_description() outside the VMThread 
as the _cur_vm_operation could be deleted while you're trying to access 
its name.

Initially I thought this might be useful for tracking down excessively 
long VM ops, but with a global timeout it can't do that. And a per-op 
timeout would be rather tricky to pass through from the command-line 
(but easy enough to use once you had it).

And as we don't have a general timer mechanism this has to use polling 
so you pay for a 10ms task wakeup regardless of how long the timeout is.

Given the limitations of the global timeout I'm not sure I see a use for 
the non-aborting form. This could just reduce down to:

-XX:AbortVMOpsAfter:NN

otherwise I don't really think this carries its weight. Of course that's 
just my opinion. Interested to hear others.

Cheers,
David

> Testing: hotspot/tier1, ad-hoc tests, jdk-submit (pending)
> 
> Thanks,
> -Aleksey
>