[crac] RFR: CRaC may exit before image dump is completed [v3]

Anton Kozlov akozlov at openjdk.org
Wed Mar 1 16:34:41 UTC 2023


On Wed, 1 Mar 2023 16:10:49 GMT, Roman Marchenko <rmarchenko at openjdk.org> wrote:

>> When running CRaC with docker, java may exit before CRIU is finished dumpring because CRIU kills the original java process, and then docker immediately exits.
>> 
>> It could be reproduced with a simple Java test:
>> 
>> public class Test {
>>     public static void main(String args[]) throws Exception {
>>         jdk.crac.Core.checkpointRestore();
>>         System.out.println("finish");
>>     }
>> }
>> 
>> and run with docker:
>> `docker <docker_options> $JAVA_HOME/java -XX:CRaCCheckpointTo=./cr Test.java`
>> 
>> After the command above finishes, `cr/cppath` is absent in the case of failure. Or/and it will fail on restore:
>> `docker <docker_options> $JAVA_HOME/java -XX:CRaCRestoreFrom=./cr`
>> 
>> This change fixes the issue by forkin'g the main process in case of PID=1 (pid=1 means it was run with docker), and waiting for children processes are finished. This makes us sure that CRIU finalized the dump, if any. At the same time, there is no conflict with PIDs on restore, since the process being restored has PID not equal to 1, if restoring with the command above..
>
> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Returning status of the child only

src/java.base/share/native/launcher/main.c line 120:

> 118:         pid = wait(&st);
> 119:         if (pid == g_child_pid && WIFEXITED(st)) {
> 120:             status = WEXITSTATUS(st);

Sorry for nit-picking, but now if the java was killed (`WIFEXITED == false`) we won't update status and will return `0`, which does not look correct. `restorewait` in this situation returns `1` [1], although better, also does not look perfect. Here I suggest be at least consistent with restorewait.

Or we can fix restorewait as well, indicating being killed by returning `128+signal`, as described in the bash manual [2]. How does it sound?

> When a command terminates on a fatal signal N, bash uses the value of 128+N as the exit status.

[1] https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/criuengine/criuengine.c#L306
[2] https://linux.die.net/man/1/bash

-------------

PR: https://git.openjdk.org/crac/pull/46


More information about the crac-dev mailing list