Call for Discussion: New Project: CRaC

Mon Jul 19 16:46:27 UTC 2021

Hi Anton,

I think this will be a useful project. I support the creation of a
dedicated OpenJDK Project for researching checkpoint/restore
technologies within the JVM and defining of a standard API for
chekpoint/restore events. I'm interested in becoming a Committer in
this project once it will be created.

I think we should make it clear that this project will investigate and
implement different, orthogonal features which can be delivered
independently. In accordance with your description I currently see
three main work streams:

1. Define a Java API which allows applications to coordinate with
checkpoint/restore events. This API should be generic enough to work
with any underlying mechanism no matter if it is initiated by the JVM
itself or externally, just for the JVM process or for the whole OS.

2. Investigate what it takes to make the JVM checkpoint/restore aware
and safe. Again, this should be as much as possible independent of the
underlying checkpointing mechanism.

3. Investigate possibilities to implement checkpoint/restore
functionality right within the JVM. Your proof of concept
implementation [1] based on CRIU [2] is certainly a good starting
point.

We in the Amazon Corretto team are currently experimenting with full
OS checkpointing and restore as provided by the snapshotting
functionality [3] in Firecracker [4]. Compared to CRIU and Docker
Checkpoint&Restore, Firecracker Snapshotting is different in the sense
that it does not only saves a single process or container but a whole
running OS. This has some advantages (i.e. you don't have to care
about file handles because they will be still valid after resume) but
also some drawbacks (i.e. the need to reseed /dev/random and to sync
the system clock).

We were wondering if we could use the API which you've proposed in
your initial post (i.e. jdk.crac [3]) to notify the JDK and
applications of suspend and resume events. In the case of Firecracker,
the source of these events would be either the kernel (through a
System Generation ID kernel driver [5] or SystemD [6]). There is
ongoing work to push these mechanisms into the respective upstream
projects but SystemD "inhibitors" [7,8] events could be used already
now to trigger the callbacks of the envisioned API.

There are several issues which we are currently investigating and
which we'd like to discuss in this project:
- Doe's it make sense to add timeouts (or a TimeoutException) to the
proposed API?
- How to deal with Pseudo Random Generators like j.u.Random? They are
specified to be deterministic and applications might rely on this
determinism. But we might also run into problems if several, cloned
JVM instances are using the same random values (e.g. as UIDs).
- How to make the JVM/JDK behave gracefully after "time-jumps".
- Is there anything special required to make the JVM "snap-safe" if
checkpointing can be initiated from outside the JVM at any arbitrary
time.

I hope we will find at least one "Sponsoring Group" [9] for this
Project such that we can continue the discussion on a dedicated
mailing list.

Thanks for proposing this group and your great work on this topic,
Volker

[1] https://github.com/org-crac/jdk/compare/jdk-base..jdk-crac
[2] https://github.com/checkpoint-restore/criu
[3] https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md
[4] https://firecracker-microvm.github.io/
[5] https://lkml.org/lkml/2021/3/8/677
[6] https://github.com/systemd/systemd/issues/19269
[7] https://www.freedesktop.org/software/systemd/man/systemd-inhibit.html
[8] https://github.com/systemd/systemd/blob/main/src/login/inhibit.c
[9] https://openjdk.java.net/projects/#new-project

On Sun, Jul 18, 2021 at 4:49 PM Anton Kozlov <akozlov at azul.com> wrote:
>
> Hi,
>
> It's been a while since we presented Coordinated Restore at Checkpoint for the
> first time [0].  We are still committed to the idea and researching this topic.
>
> Java applications can avoid the long start-up and warm-up by saving the state
> of the Java runtime (snapshot, checkpoint).  The saved state is then used to
> start instances fast (restored).  But after the state was saved, the execution
> environment could change.  Also, if multiple instances are started from the
> saved state simultaneously, they should obtain some uniqueness, and their
> executions should diverge at some point.
>
> We believe that the practical way to solve these problems is to make Java
> applications aware of when the state is saved and restored.  Then an
> application will be able to handle environmental changes.  The application will
> also be able to obtain uniqueness from the environment.
>
> The CRaC project aims to research Java API for coordination between application
> and runtime to save and restore the state.  Runtime should support multiple
> ways to save the state: virtual machine snapshot, container snapshot, CRIU
> project on Linux, etc.  We hope to come with an API that is general enough for
> any underlying mechanism.  We also plan to explore safety checks in the API and
> runtime, which prevent saving the state if it may not be restored or work
> correctly after the restore.
>
> I propose myself as a Project Lead of the CRaC Project.  If you're interested
> or want to be the committer, please drop me a message.
>
> A fork of JDK [1] would be a starting point of this project.
>
> Thanks,
> Anton
>
> [0] https://mail.openjdk.java.net/pipermail/discuss/2020-September/005594.html
> [1] https://github.com/CRaC/jdk
>