Call for Discussion: New Project: CRaC

Michael Bien mbien42 at gmail.com
Tue Jul 20 15:31:20 UTC 2021


Hello,

great to hear that there is research done in this area.

I did some experimenting myself by just binding to the CRIU C-API via 
panama some time ago[1][2]. It quickly became clear that, although it 
worked surprisingly well, it probably required a lower level approach to 
properly implement it. (I was mostly interested in CRIUs rootless 
mode[3] and restoring warmed up JVMs, which came with its own issues and 
kernel bugs)

Checkpointing the JVM is probably much safer when all threads have 
stopped and more economical when the heap is compacted - the JVM itself 
is in a better position to do that than the java application.

CRIU can't deal with situations when files changed between checkpoint 
and restore. Restoring a java program which is logging to a file will 
only work once, a second attempt would fail since the file changed due 
to the first restore. An API might be able to mitigate a lot of this, 
e.g a logger could rotate the log to a empty file, or close the file on 
checkpoint an reopen it on restore. JFR should do this out of the box. I 
was wondering if the IO stream impl itself could help in some situations.

Non-file related APIs might have to be made restore-aware too. For 
example SecureRandom might require re-seeding, keystores/SSL certs might 
need special attention etc.


although it worked surprisingly well (restoring was also quite fast), 
implementing it at the java application level would be fairly limited. 
Looking forward to hear/see more from CRaC!

best regards,
michael

[1] https://github.com/mbien/JCRIU/
[2] https://mbien.dev/blog/entry/java-and-rootless-criu-using
[3] https://github.com/checkpoint-restore/criu/pull/1155

On 18.07.21 16:48, Anton Kozlov wrote:
> Hi,
>
> It's been a while since we presented Coordinated Restore at Checkpoint 
> for the
> first time [0].  We are still committed to the idea and researching 
> this topic.
>
> Java applications can avoid the long start-up and warm-up by saving 
> the state
> of the Java runtime (snapshot, checkpoint).  The saved state is then 
> used to
> start instances fast (restored).  But after the state was saved, the 
> execution
> environment could change.  Also, if multiple instances are started 
> from the
> saved state simultaneously, they should obtain some uniqueness, and their
> executions should diverge at some point.
>
> We believe that the practical way to solve these problems is to make Java
> applications aware of when the state is saved and restored.  Then an
> application will be able to handle environmental changes.  The 
> application will
> also be able to obtain uniqueness from the environment.
>
> The CRaC project aims to research Java API for coordination between 
> application
> and runtime to save and restore the state.  Runtime should support 
> multiple
> ways to save the state: virtual machine snapshot, container snapshot, 
> CRIU
> project on Linux, etc.  We hope to come with an API that is general 
> enough for
> any underlying mechanism.  We also plan to explore safety checks in 
> the API and
> runtime, which prevent saving the state if it may not be restored or work
> correctly after the restore.
>
> I propose myself as a Project Lead of the CRaC Project.  If you're 
> interested
> or want to be the committer, please drop me a message.
>
> A fork of JDK [1] would be a starting point of this project.
>
> Thanks,
> Anton
>
> [0] 
> https://mail.openjdk.java.net/pipermail/discuss/2020-September/005594.html
> [1] https://github.com/CRaC/jdk
>



More information about the discuss mailing list