RFR: 8302073: Specifying OnError handler prevents WatcherThread to break a deadlock in report_and_die()

Thu Mar 9 06:30:18 UTC 2023

The message from this sender included one or more files
which could not be scanned for virus detection; do not
open these files unless you are certain of the sender's intent.

----------------------------------------------------------------------
On Wed, 8 Mar 2023 21:31:14 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Pinging @dholmes-ora for input and opinion
>
> I thought we were still discussing options in JBS so this PR seems premature to me. I agree with @tstuefe initial comment this seems way too complex and I'm not even sure I can figure out the control flow here.

@dholmes-ora My proposal would be to be pragmatic and continue the discussion here. This has been going on too long. Switching channels again would be more confusing. 

I would prefer JBS or ML discussion as well, but not everyone (eg. @jsolomon8080) has write access to JBS.

----

I thought this over some more.

A big part of this problem is a plain bug in VMError::check_timeout(): it should never block *step timeouts*. We added step timeouts back in 2016 to deal with these kinds of problems: if the JVM hangs in error reporting, give it a kick to get it going again. Do so repeatedly. This is related to the global timeout, but works also independently from it (and therefore from the question of when to honor the global timeout).

I opened https://bugs.openjdk.org/browse/JDK-8303861 to deal with this, see PR: https://github.com/openjdk/jdk/pull/12936

This may already be a big help for cases like these. In fact, it may already be enough, and we could maybe close this issue.

However, if we have a recursive malloc situation, we may hang repeatedly. JVM will kick itself alive every time (with my patch) but this is still annoying. The root problem here is that we should not use malloc during error handling. Cannot always be avoided, but at least we should minimize malloc use. 

Decoder, in particular, should not use malloc. Therefore I also opened https://bugs.openjdk.org/browse/JDK-8303862 to track that. I won't have time to work on that. I have the hope that maybe @chhagedorn can :-) ? Otherwise Azul may also chip in some bug fixing.

----

All these are unrelated to the question of whether OnError should be blocked or not. I realize now that if we decide to (continue to) protect OnError from timeouts, we must never act on the global timeout until all OnError steps ran. Since these run right before VM exit, the original implementation that just blocked the global timeout altogether was actually right.

So, there's that. Maybe JDK-8303861 is already enough for cases like this. At least JDK-8303861 does not introduce any backward compatibility issues.

-------------

PR: https://git.openjdk.org/jdk/pull/12925