Call for Discussion: New Project: CRaC

Thu Jul 22 19:17:29 UTC 2021

On 7/19/21 7:46 PM, Volker Simonis wrote:
> We were wondering if we could use the API which you've proposed in
> your initial post (i.e. jdk.crac [3]) to notify the JDK and
> applications of suspend and resume events. In the case of Firecracker,
> the source of these events would be either the kernel (through a
> System Generation ID kernel driver [5] or SystemD [6]). There is
> ongoing work to push these mechanisms into the respective upstream
> projects but SystemD "inhibitors" [7,8] events could be used already
> now to trigger the callbacks of the envisioned API.

I would prefer a pure file interface to something systemd specific (a file
managed by the systemd is OK).  In this case, the coordination could be
implemented in the JVM without any dependencies, or be done in pure Java.  For
example, a thread waits for updates from the special file, triggers CRaC's
beforeCheckpoint's, signals OK to the snapshotting.  However, I don't see how
prepare-for-snapshot is communicated via the kernel file.  The LKML doc
suggests only a workload-specific way [1, Example snapshot-safe workflow, point
1].  The interface seems to provide only notification after the VM is resumed.

How the systemd would help is not clear, since the inhibitors are locks, how
would it be possible to know the lock should be taken to run
beforeCheckpoint's?

 From CRaC's side, it would be possible to break a single checkpointRestore
method [1] into two steps (one calls beforeCheckpoint's and another calls
afterRestore's).  Steps can be exposed via API and e.g. jcmd.  Will that help
running beforeCheckpoint's before the actual snapshot is taken, e.g. to clear
up the state from secrets?

> There are several issues which we are currently investigating and
> which we'd like to discuss in this project:
> - Doe's it make sense to add timeouts (or a TimeoutException) to the
> proposed API?

This seems reasonable in some cases.  But for some, you may want synchronous
execution of callbacks that may take arbitrarily long.  E.g. you want to shut
down a web server that is connected to a database, and you want to be sure all
clients are served before shutting down the DB connection.

So a resource with the timeout should be probably done on top of synchronous
Resource notification.  Something like a separate thread that is waited with
the timeout and is interrupted after the time is out.  The callback is not
guaranteed to stop the execution, but the Context will know the callback has
failed to finish in time immediately.

It looks good that Resources restricted in time will specify the timeout by
themselves.

> - How to deal with Pseudo Random Generators like j.u.Random? They are
> specified to be deterministic and applications might rely on this
> determinism. But we might also run into problems if several, cloned
> JVM instances are using the same random values (e.g. as UIDs).

Apparently, only a user for j.u.Random can distinguish two cases.  At least, a
Random that should provide distinct random values can be manually re-seeded
after the restore.  Probably, it's possible to differentiate two classes of
j.u.Random instances (with deterministic outputs after restore and ones
without) and handle them automatically by looking were they constructed with
the seed or not.  But this needs to be checked thoroughly.

> - How to make the JVM/JDK behave gracefully after "time-jumps".

I assume there should be no correctness problems, as the time-jump does not
substantially differ from a time spent off-CPU due to OS scheduling.  Some
internal counters could overflow, but this does not look more than just a bug
that needs fixing.

However, I saw cases when CRIU did restore monotonic clock that broke timed
waits, causing 100% of CPU loaded with an improper time limit.  After not
restoring the clock completely, the issue has gone away.  That brought us again
to the time jump, which was correctly handled.

> - Is there anything special required to make the JVM "snap-safe" if
> checkpointing can be initiated from outside the JVM at any arbitrary
> time.

Now in the CRaC, after beforeCheckpoint's have run, the actual checkpoint is
done while JVM is in the safepoint.  There we check that there are no open
files, sockets and then call CRIU against the Java process.  I'm not aware of
problems with snapshotting the process at an arbitrary moment, so the safepoint
matters only for the checks.

Thanks,
Anton

[1] https://lkml.org/lkml/2021/3/8/677
[2] https://github.com/CRaC/jdk/blob/jdk-crac/src/java.base/share/classes/jdk/crac/Core.java#L102