Two ideas and a bug

Fri Sep 6 15:10:28 UTC 2024

Hi Charles,

comments inline...

On 06. 09. 24 15:56, Anton Kozlov wrote:
> Caution: This email originated from outside of the organization. Do 
> not click links or open attachments unless you recognize the sender 
> and know the content is safe.
>
>
> On 9/5/24 7:22 AM, Charles Oliver Nutter wrote:
>> Baseline "hello world" startup of JRuby improves by 15-20x, which 
>> says as much about our fat boot cycle as it does about CRaC's 
>> outstanding restore performance.
>
> Thank you for sharing such an awesome result!
>
>> * Idea #1: CRaC checkpointing as a really slow JVM fork(2).
>>
>> JRuby has never been able to support forking the JVM because of 
>> challenges restoring the new process to full functionality: 
>> restarting GC and JIT threads, managing signals and file descriptors, 
>> etc. CRaC is already doing that in order to restore from a checkpoint!
>>
>> What if I wanted the checkpoint process to keep executing, but start 
>> up a child process by restoring the checkpoint I just acquired? 
>> Presto, super-slow forking!
>
> Technically this is possible, and it looks reasonable. One of the 
> primary use-cases for CRaC is to be able to quickly scale java 
> instances, so forking in that way will be used for scale java 
> processes on a single machine.
>
> Right now if you export CRAC_CRIU_LEAVE_RUNNING=1 environment var, the 
> original process will be kept alive. Then you should be able to 
> restore from the (temporary?) image.

The second restore would fail currently; CRIU will attempt to restore 
with the same PID/TIDs as the running instance and that will fail. I 
think that Anton experimented in the past with CRIU allowing to restore 
at different PIDs, and from Java POV this is mostly OK. But I think that 
this would require modifications in CRIU, some ugly code that would be 
dependent on GLIBC version - the thread IDs are stored somewhere on the 
beginning of stack.

However, if you'd be OK with restoring it in a new cgroups namespace, 
PID conflicts could be avoided.

>
>> * Idea #2: Incremental checkpointing
>>
>> I don't know if there's any technical limitation on acquiring a new 
>> checkpoint after restoring from an old checkpoint, but there's one 
>> practical limitation: you can't change the target directory for the 
>> new checkpoint.
>>
>> I would like to be able to incrementally improve a checkpoint, 
>> dumping the image to a new directory of my choosing each time. This 
>> would allow a checkpoint/restore chain similar to re-forking servers, 
>> which base later forks on the warmed-up children of previous forks. I 
>> could provide a baseline JRuby image that users could customize to 
>> their specific applications and load patterns.

There's one more aspect to this: with default configuration we use in 
CRaC the files in image directory are mmaped into memory, rather than 
loaded (that's what vanilla CRIU does). This is a very important 
performance optimization (you can disable that and see the impact with 
CRAC_CRIU_OPTS=-no-mmap-page-image). However, this means that the second 
checkpoint will observe this file mapping and will record it as it is - 
that way you're eventually getting dependent on all the images, not only 
the last one.

Again, the solution exists: there can be a background thread in JVM 
concurrently copying the bits to a new chunk of memory and mremapping 
that into original place. The memory that is being copied must be 
(temporarily) mapped read-only and we have to handle SIGSEGV trying to 
write into that. In addition, we'd need to be a bit careful to let 
kernel join this remapped VMAs together, but I am going into obscure 
implementation details. Architecturally, this is breaking the border of 
responsibility: normally this should be the job for CRIU but now it is 
something that runs inside JVM (so part of the JVM? parasite thread 
injected by CRIU? dealing with this externally through ptrace API?).

>>
>> It would seem a checkpointRestore(Path) should be doable, yes?

I totally second the suggestion to have this in the API; however there 
must be a practical application for the second checkpoint given the 
problem above.

Anyway, thanks for these ideas. I believe that it's important to keep a 
big picture of all the use-cases for CRaC in mind, rather than thinking 
just about quick microservice startup somewhere in the cloud. JRuby can 
definitely bring a different set of problems to the discussion; we just 
need to crac(k) them :)

Radim

>
> CRaCCheckpointTo can be set on the restore, along a few other 
> commands. But there is at least one bug, second checkpoint choose a 
> wrong destination path. We'll investigate this in the bug 
> https://bugs.openjdk.org/browse/JDK-8339662.
>
> $JAVA -XX:CRaCCheckpointTo=img1 -DpreLoop Test.java
> init
> start
> stage 1: 1
> stage 1: 2
> stage 1: 3
> Sep 06, 2024 12:33:10 PM jdk.internal.crac.LoggerContainer info
> INFO: Starting checkpoint
> beforeCheckpoint
> Killed
>
> $JAVA -XX:CRaCRestoreFrom=img1 -XX:CRaCCheckpointTo=asdf
> afterRestore
> stage 1: 4
> stage 1: 5
> stage 1: 6
> stage 1: 7
> Sep 06, 2024 12:33:17 PM jdk.internal.crac.LoggerContainer info
> INFO: Starting checkpoint
> beforeCheckpoint
> Error (criu/image.c:577): Can't open dir ubuntu: No such file or 
> directory
> Error (criu/crtools.c:237): Couldn't open image dir ubuntu
> ...
> stage 1: 8
> stage 1: 9
> stage 1: 10
>
>> * Possible bug: overwriting a compressed checkpoint with an 
>> uncompressed checkpoint produces a non-bootable image.
>
> I can confirm this. It's very likely caused by us not cleaning the 
> target directory, and the fact we detect the type of the image by the 
> presence of the compressed part. So we'll track this under 
> https://bugs.openjdk.org/browse/JDK-8339663.
>
>> You should be able to reproduce this with a build of JRuby 
>> (https://github.com/jruby/jruby <https://github.com/jruby/jruby>) 
>> from the "crac" branch.
>
> Thank you very much for the all feedback!
>
> -- Anton
>