[crac] RFR: Ensure all notifications finish even if only daemon threads remain
Radim Vansa
duke at openjdk.org
Thu May 4 15:42:55 UTC 2023
On Thu, 4 May 2023 13:34:47 GMT, Anton Kozlov <akozlov at openjdk.org> wrote:
> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem.
>
> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread.
The code is doing something (starting a thread) before the checkpoint, and after restore (join the thread). Rather than clobbering the 'general' C/R code with a fix for one issue, there is an interface perfectly suited to host this - a Resource.
There's a catch, though - if we register it to the JDKContext this would be run *after* user Resources, and we need to run this *before*. Right now we don't have any means to order things this way.
One solution could be reversing the JDKContext/GlobalContext relationship - JDKContext would be the parent, and would use a specific priority class for GlobalContext (something before `NORMAL`). And there would be one more priority class even before that, and this `KeepaliveResource` would be registered at it.
src/java.base/share/classes/jdk/crac/Core.java line 254:
> 252: // The notifications are done on the original thread.
> 253: CountDownLatch start = new CountDownLatch(1);
> 254: CountDownLatch finish = new CountDownLatch(1);
Rather than 2 CountDownLatches you could use either single `CyclicBarrier` (awaiting twice), or even better a `Phaser` that has API for non-interruptible wait.
test/jdk/jdk/crac/DaemonAfterRestore.java line 82:
> 80: System.out.println("worker thread finish");
> 81: });
> 82: workerThread.setDaemon(false);
Unnecessary, thread created from main thread (non-daemon) is not a daemon either.
test/jdk/jdk/crac/DaemonAfterRestore.java line 94:
> 92: @Override
> 93: public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
> 94: finish.countDown();
It might be worth asserting that here we are running in a daemon thread.
test/jdk/jdk/crac/DaemonAfterRestore.java line 98:
> 96: @Override
> 97: public void afterRestore(Context<? extends Resource> context) throws Exception {
> 98: Thread.sleep(3000);
Could we avoid extending the testsuite run by 3 seconds? I know we're trying to assert that 'nothing happens' rather then replacing waiting for an event by a timed wait. If we can't check that the destroy VM thread had a chance to work, I suggest at least reducing this (say 100ms?).
-------------
PR Review: https://git.openjdk.org/crac/pull/62#pullrequestreview-1413342072
PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185159712
PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185171160
PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185175830
PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185196948
More information about the crac-dev
mailing list