CFV: New Project: CRaC

Fri Aug 6 15:37:37 UTC 2021

On 8/6/21 12:49 PM, Andrew Dinn wrote:
> Well, I am afraid that the above assertion about the limited room for things to go wrong thanks to a 'simple' save and restore does not self-evidently qualify as commonly agreed fact. It might well be true for /some/ apps, especially if they are saved and then restored on exactly the same host from the same parent process with little or no intervening change to the underlying operating and process system environment (and ignoring any effect the pause might have on timings). However, even in those limited, albeit rather vaguely defined, circumstances I don't think we can presume it will be true for all apps.

My point was about Java Lang (JLS), I'm not aware of something that needs to be
fixed there.

It should be easy to construct an app that will be incompatible with CRaC.  But
handling such apps automatically by the platform is not something we are
looking for.  Instead, we propose a new mechanism for apps to coordinate,
something that was not possible before.  To coordinate or not is a choice of
the app programmer, so even if the mechanism presents in JDK, additional work
needs to be done by the programmer on the level of the application semantic.

> It's worse than that though. For the primary use case I think you are talking about -- running services in a container on cloud infrastructure -- I think it is quite possible that a deployment may fail even to meet those rather vaguely set out requirements.  (1) Might a saved app state not be resumed many times, possibly concurrently? (2) Might local OS resources conceivably change between restarts? (3) Might the parent process env conceivably change between restarts? (4) Indeed, might those effects not actually occur because of the very fact that the app is being restarted?

If points 1-4 relate to an application depending on them to work correctly, we
propose for the application to resolve these problems.  It is possible to
construct an application that is impossible to start multiple times
concurrently without CRaC involved.  For example, because of a fixed temp file
that would be used by all instances concurrently and therefore trashed.
Applications are coded in a special way to support multiple concurrent
instances.  So the same should be the case for apps that are saved and
restored.  Same reasoning could be applied to every point.

Points 2, 3, and partly 1 relate to JDK and JVM implementation, which are
discussed below.

> Unfortunately, the current operation of the JVM and JDK implicitly presumes a high degree of continuity in the process and OS env during execution. It has most definitely not been made 100% resilient to any sudden discontinuity from one point of execution to the next thanks to whatever change might happen in some intervening hiatus caused by a save and subsequent restore.

I completely agree with this fact.  But it is about their implementation,
something that may be adapted to new requirements.  For example, dropping
cached values on this checkpoint (and this will be possible with CRaC).

There is a trade-off between how useful the image is and how general it is.  On
one side there is a very similar execution environment after restore and the
state that can be slightly fixed.  On the other, completely another environment
(different OS and CPU), and much more abstracted state that won't have
JIT-compiled code and thus less useful.  I see the benefit of the former and
in-between values (same CPU with different flags).  Then, it will be useful to
have a set of flags to control how general the image is.  But, it will be a
deployment task and won't require changes from the conservatively written
applications.  It's similar to C compiler flags that control the target
microarch.  We cannot expect all values are good for all targets.

The current plan is to start with the most useful and optimized approach and
move to the more general, using common sense for what should be done and what
is required.  Such decisions are always can be re-thought and more steps could
be taken in the abstracting direction.

> We could ignore that problem on the grounds that it is not a goal of the project to consider it and claim, in consequence, that anyone running into problems because of it has to look out for themselves. However, in doing so I believe we would run the risk of significantly lowering the value of whatever the project might come up with.
>
> The alternative is to plan for and spend effort looking through the JDK runtime and JVM code to identify where it needs to be made resilient to save and restore and provide means for any potential disruption that it might cause to be detected and either repaired or, at least, managed with a graceful failure. I believe very firmly that this latter course is the one we should follow.

This sounds very reasonable to me.  It is aligned with existing checks for open
files and network connections.  Why this cannot and should not be done in
parallel?  Something runnable and testable, although unperfect, will greatly
help in finding the problems.

The API will be the way to repair the state or to gracefully fail (as it
supports now, by allowing to throw an exception)

Thanks,
Anton