RFR (S) 8181143: Introduce diagnostic flag to abort VM on too long VM operations

Tue Nov 20 07:07:44 UTC 2018

Hi,

I like the idea in general.

I don't know whether I need 10ms resolution though - if we limit the
goal to catching just hanging VMOps - which would still pretty useful
- 1s would even be enough.

I think we can do with just two flags, since VMOperationTimeout and
VMOperationTimeoutDelay are redundant. Just go with the delay, make -1
the default that does nothing. Same could go for SafepointTimeout vs
SafepointTimeoutDelay.

I also agree with others that it would make more sense were we to kill
the VMThread instead (e.g. just with pthread_kill(SIGILL) or similar)
- we do something like that in error handling (see ErrorLogTimeout).

just my 2 cents.

..Thomas

On Mon, Nov 19, 2018 at 7:16 AM David Holmes <david.holmes at oracle.com> wrote:
>
> Hi Aleksey,
>
> First the synopsis is not accurate:
>
> "Introduce diagnostic flag to abort VM on too long VM operations"
>
> You're not just introducing one diagnostic flag, your introducing the
> entire VM operation timeout mechanism, including two product flags and
> one diagnostic. So the CR needs to reflect that clearly and you will
> need a CSR request to add the two product flags. And they will need to
> be documented.
>
> Three flags just for this makes me cringe. (Yes it mirrors the safepoint
> timeout flags but if that were proposed today I'd have the same reaction.)
>
> On 17/11/2018 2:30 am, Aleksey Shipilev wrote:
> > RFE:
> >    https://bugs.openjdk.java.net/browse/JDK-8181143
> >
> > Webrev:
> >    http://cr.openjdk.java.net/~shade/8181143/webrev.03/
> >
> > SafepointTimeout is nice to discover long/stuck safepoint syncs. But it is as important to discover
> > long/stuck VM operations. This patch complements the timeout machinery with tracking VM operation
> > themselves. Among other things, this allows to terminate the VM when very long VM operation is
> > blocking progress. High-availability users would enjoy fail-fast JVM -- in fact, the original
> > prototype was done as request from Apache Ignite developers.
> >
> > Example with -XX:+VMOperationTimeout -XX:VMOperationTimeoutDelay=100 -XX:+AbortVMOnVMOperationTimeout:
> >
> > [3.117s][info][gc,start] GC(2) Pause Young (Normal) (G1 Evacuation Pause)
> > [3.224s][warning][vmthread] VM Operation G1CollectForAllocation took longer than 100 ms
> > #
> > # A fatal error has been detected by the Java Runtime Environment:
> > #
> > #  Internal Error (/home/sh/jdk-jdk/src/hotspot/share/runtime/vmThread.cpp:218), pid=2536, tid=2554
> > #  fatal error: VM Operation G1CollectForAllocation took longer than 100 ms
> > #
>
> It's not safe to access vm_safepoint_description() outside the VMThread
> as the _cur_vm_operation could be deleted while you're trying to access
> its name.
>
> Initially I thought this might be useful for tracking down excessively
> long VM ops, but with a global timeout it can't do that. And a per-op
> timeout would be rather tricky to pass through from the command-line
> (but easy enough to use once you had it).
>
> And as we don't have a general timer mechanism this has to use polling
> so you pay for a 10ms task wakeup regardless of how long the timeout is.
>
> Given the limitations of the global timeout I'm not sure I see a use for
> the non-aborting form. This could just reduce down to:
>
> -XX:AbortVMOpsAfter:NN
>
> otherwise I don't really think this carries its weight. Of course that's
> just my opinion. Interested to hear others.
>
> Cheers,
> David
>
> > Testing: hotspot/tier1, ad-hoc tests, jdk-submit (pending)
> >
> > Thanks,
> > -Aleksey
> >