RFR: 8309637: runtime/handshake/HandshakeTimeoutTest.java fails with "has not cleared handshake op" and SIGILL

Thu Jul 6 07:04:54 UTC 2023

On Wed, 5 Jul 2023 18:15:50 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

> Please review the following fix. The test is checking the correct behavior of flag HandshakeTimeout by spawning a child VM and verifying that it crashes due to a timeout during a handshake operation. The test sometimes times out though because the child VM deadlocks during error reporting. The issue is that the JavaThread doing the error reporting deadlocks trying to acquire a lock it already owns, and the Watcher thread never kicks in to shutdown the VM because it hasn't been created yet and will never be: the "main" JavaThread is blocked in Threads::create_vm(), somewhere before creating the Watcher thread, waiting to acquire a lock that the error reporting thread owns. There are more details about the specifics resources involved in the deadlocks in the bug comments.
> 
> I moved the start of the Watcher thread further up during the initialization steps so that it is there even before creating the VMThread. Ideally it should be one of the first things we create in Threads::create_vm() but unfortunately there are dependencies. I found the earliest we can create it is at the same place the AsyncLogWriter thread is created since the barrier set needs to have been created already. I didn't move the creation there but just after that call to init_globals(). To keep the current behavior where enrolled tasks are only processed after the VM has been fully created I added a first loop in WatcherThread::run() where we only check for error reporting hangs. Most of the other changes in the patch are factoring out code.
> 
> The test timeout can be reproduced by adding a delay in ThreadCritical() (one of the locks involved in the deadlock). I verified the fix by running the test with that extra delay and verified it doesn't time out anymore. I also run tiers1-3 in mach5. I'll run the upper tiers too.
> 
> Thanks,
> Patricio

Great to get to the bottom of this. The actual deadlocks are a nuisance that would ideally be avoided - but unfortunately error reporting is far too full of things that might not be reasonable to do during error reporting.

The code changes seem a bit more elaborate than necessary though - comment below.

Thanks.

src/hotspot/share/runtime/nonJavaThread.cpp line 276:

> 274:   while (true) {
> 275:     // Just check for error reporting hangs until either VM is fully
> 276:     // initialized or we are notified to stop running.

Couldn't you achieve the same affect more simply by just conditionalizing the call to `PeriodicTask::real_time_tick(time_waited);` in the loop below i.e.

if (_run_all_tasks) {
  PeriodicTask::real_time_tick(time_waited);
}

?
That avoids the need to refactor anything AFAICS.

-------------

PR Review: https://git.openjdk.org/jdk/pull/14777#pullrequestreview-1515883926
PR Review Comment: https://git.openjdk.org/jdk/pull/14777#discussion_r1254017976