Call for Discussion: New Project: CRaC

Mon Jul 19 21:40:23 UTC 2021

Hi Volker,

thanks for your reply! It will be great to see you as a Committer.

Indeed, the CRaC is not about a single checkpoint/restore implementation.
A set of deliverables would likely include API and something in JVM, but I'm not
sure about the total set -- it's would be a part of the research, depending on
what we'll learn.  That's correct that there is no specific order between
deliverables.  The current implementation based on CRIU is just an example
(that needs revisiting).  Although, it was necessary to understand the
practical implication of checkpoint/restore and to get the first problems.

Having more mechanisms early will be very useful.  A VM like Firecracker is
probably on another side of the spectrum of mechanisms for checkpoint/restore.
The VM environment looks rather different compared to a process.  So it is
interesting what does it demand from the API and safety checks.  Such work
should highlight what I hard-coded for CRIU inadvertently for myself.

The questions likely need elaboration and likely experimenting/testing.
I would love to have this discussed on the appropriate mail list of the
project.

Yes, we need a sponsoring group.

Thanks,
Anton

On 7/19/21 7:46 PM, Volker Simonis wrote:
> Hi Anton,
> 
> I think this will be a useful project. I support the creation of a
> dedicated OpenJDK Project for researching checkpoint/restore
> technologies within the JVM and defining of a standard API for
> chekpoint/restore events. I'm interested in becoming a Committer in
> this project once it will be created.
> 
> I think we should make it clear that this project will investigate and
> implement different, orthogonal features which can be delivered
> independently. In accordance with your description I currently see
> three main work streams:
> 
> 1. Define a Java API which allows applications to coordinate with
> checkpoint/restore events. This API should be generic enough to work
> with any underlying mechanism no matter if it is initiated by the JVM
> itself or externally, just for the JVM process or for the whole OS.
> 
> 2. Investigate what it takes to make the JVM checkpoint/restore aware
> and safe. Again, this should be as much as possible independent of the
> underlying checkpointing mechanism.
> 
> 3. Investigate possibilities to implement checkpoint/restore
> functionality right within the JVM. Your proof of concept
> implementation [1] based on CRIU [2] is certainly a good starting
> point.
> 
> We in the Amazon Corretto team are currently experimenting with full
> OS checkpointing and restore as provided by the snapshotting
> functionality [3] in Firecracker [4]. Compared to CRIU and Docker
> Checkpoint&Restore, Firecracker Snapshotting is different in the sense
> that it does not only saves a single process or container but a whole
> running OS. This has some advantages (i.e. you don't have to care
> about file handles because they will be still valid after resume) but
> also some drawbacks (i.e. the need to reseed /dev/random and to sync
> the system clock).
> 
> We were wondering if we could use the API which you've proposed in
> your initial post (i.e. jdk.crac [3]) to notify the JDK and
> applications of suspend and resume events. In the case of Firecracker,
> the source of these events would be either the kernel (through a
> System Generation ID kernel driver [5] or SystemD [6]). There is
> ongoing work to push these mechanisms into the respective upstream
> projects but SystemD "inhibitors" [7,8] events could be used already
> now to trigger the callbacks of the envisioned API.
> 
> There are several issues which we are currently investigating and
> which we'd like to discuss in this project:
> - Doe's it make sense to add timeouts (or a TimeoutException) to the
> proposed API?
> - How to deal with Pseudo Random Generators like j.u.Random? They are
> specified to be deterministic and applications might rely on this
> determinism. But we might also run into problems if several, cloned
> JVM instances are using the same random values (e.g. as UIDs).
> - How to make the JVM/JDK behave gracefully after "time-jumps".
> - Is there anything special required to make the JVM "snap-safe" if
> checkpointing can be initiated from outside the JVM at any arbitrary
> time.
> 
> I hope we will find at least one "Sponsoring Group" [9] for this
> Project such that we can continue the discussion on a dedicated
> mailing list.
> 
> Thanks for proposing this group and your great work on this topic,
> Volker
> 
> [1] https://github.com/org-crac/jdk/compare/jdk-base..jdk-crac
> [2] https://github.com/checkpoint-restore/criu
> [3] https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md
> [4] https://firecracker-microvm.github.io/
> [5] https://lkml.org/lkml/2021/3/8/677
> [6] https://github.com/systemd/systemd/issues/19269
> [7] https://www.freedesktop.org/software/systemd/man/systemd-inhibit.html
> [8] https://github.com/systemd/systemd/blob/main/src/login/inhibit.c
> [9] https://openjdk.java.net/projects/#new-project
> 
> On Sun, Jul 18, 2021 at 4:49 PM Anton Kozlov <akozlov at azul.com> wrote:
>>
>> Hi,
>>
>> It's been a while since we presented Coordinated Restore at Checkpoint for the
>> first time [0].  We are still committed to the idea and researching this topic.
>>
>> Java applications can avoid the long start-up and warm-up by saving the state
>> of the Java runtime (snapshot, checkpoint).  The saved state is then used to
>> start instances fast (restored).  But after the state was saved, the execution
>> environment could change.  Also, if multiple instances are started from the
>> saved state simultaneously, they should obtain some uniqueness, and their
>> executions should diverge at some point.
>>
>> We believe that the practical way to solve these problems is to make Java
>> applications aware of when the state is saved and restored.  Then an
>> application will be able to handle environmental changes.  The application will
>> also be able to obtain uniqueness from the environment.
>>
>> The CRaC project aims to research Java API for coordination between application
>> and runtime to save and restore the state.  Runtime should support multiple
>> ways to save the state: virtual machine snapshot, container snapshot, CRIU
>> project on Linux, etc.  We hope to come with an API that is general enough for
>> any underlying mechanism.  We also plan to explore safety checks in the API and
>> runtime, which prevent saving the state if it may not be restored or work
>> correctly after the restore.
>>
>> I propose myself as a Project Lead of the CRaC Project.  If you're interested
>> or want to be the committer, please drop me a message.
>>
>> A fork of JDK [1] would be a starting point of this project.
>>
>> Thanks,
>> Anton
>>
>> [0] https://mail.openjdk.java.net/pipermail/discuss/2020-September/005594.html
>> [1] https://github.com/CRaC/jdk
>>
>