[crac] RFR: Ensure all notifications finish even if only daemon threads remain

Thu May 4 15:42:55 UTC 2023

On Thu, 4 May 2023 13:34:47 GMT, Anton Kozlov <akozlov at openjdk.org> wrote:

> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem.
> 
> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread.

The code is doing something (starting a thread) before the checkpoint, and after restore (join the thread). Rather than clobbering the 'general' C/R code with a fix for one issue, there is an interface perfectly suited to host this - a Resource.
There's a catch, though - if we register it to the JDKContext this would be run *after* user Resources, and we need to run this *before*. Right now we don't have any means to order things this way.
One solution could be reversing the JDKContext/GlobalContext relationship - JDKContext would be the parent, and would use a specific priority class for GlobalContext (something before `NORMAL`). And there would be one more priority class even before that, and this `KeepaliveResource` would be registered at it.

src/java.base/share/classes/jdk/crac/Core.java line 254:

> 252:                 // The notifications are done on the original thread.
> 253:                 CountDownLatch start = new CountDownLatch(1);
> 254:                 CountDownLatch finish = new CountDownLatch(1);

Rather than 2 CountDownLatches you could use either single `CyclicBarrier` (awaiting twice), or even better a `Phaser` that has API for non-interruptible wait.

test/jdk/jdk/crac/DaemonAfterRestore.java line 82:

> 80:             System.out.println("worker thread finish");
> 81:         });
> 82:         workerThread.setDaemon(false);

Unnecessary, thread created from main thread (non-daemon) is not a daemon either.

test/jdk/jdk/crac/DaemonAfterRestore.java line 94:

> 92:             @Override
> 93:             public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
> 94:                 finish.countDown();

It might be worth asserting that here we are running in a daemon thread.

test/jdk/jdk/crac/DaemonAfterRestore.java line 98:

> 96:             @Override
> 97:             public void afterRestore(Context<? extends Resource> context) throws Exception {
> 98:                 Thread.sleep(3000);

Could we avoid extending the testsuite run by 3 seconds? I know we're trying to assert that 'nothing happens' rather then replacing waiting for an event by a timed wait. If we can't check that the destroy VM thread had a chance to work, I suggest at least reducing this (say 100ms?).

-------------

PR Review: https://git.openjdk.org/crac/pull/62#pullrequestreview-1413342072
PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185159712
PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185171160
PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185175830
PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185196948