CFV: New Project: CRaC
adinn at redhat.com
Fri Aug 6 09:49:51 UTC 2021
On 05/08/2021 21:06, Anton Kozlov wrote:
> The checkpoint/restore concept does not require that big changes to
> the Java lang. Two (or more) phases of the Java application
> life-cycle are implicit. The execution of a Java program is the same,
> but it may just pause for a while. It's much different from
> compile-time/run-time separation for static images.
My father had a good friend who frequently relied on the gambit "It's a
well-known fact that ..." to sway an argument. Well, I am afraid that
the above assertion about the limited room for things to go wrong thanks
to a 'simple' save and restore does not self-evidently qualify as
commonly agreed fact. It might well be true for /some/ apps, especially
if they are saved and then restored on exactly the same host from the
same parent process with little or no intervening change to the
underlying operating and process system environment (and ignoring any
effect the pause might have on timings). However, even in those limited,
albeit rather vaguely defined, circumstances I don't think we can
presume it will be true for all apps.
It's worse than that though. For the primary use case I think you are
talking about -- running services in a container on cloud infrastructure
-- I think it is quite possible that a deployment may fail even to meet
those rather vaguely set out requirements. Might a saved app state not
be resumed many times, possibly concurrently? Might local OS resources
conceivably change between restarts? Might the parent process env
conceivably change between restarts? Indeed, might those effects not
actually occur because of the very fact that the app is being restarted?
Unfortunately, the current operation of the JVM and JDK implicitly
presumes a high degree of continuity in the process and OS env during
execution. It has most definitely not been made 100% resilient to any
sudden discontinuity from one point of execution to the next thanks to
whatever change might happen in some intervening hiatus caused by a save
and subsequent restore.
We could ignore that problem on the grounds that it is not a goal of the
project to consider it and claim, in consequence, that anyone running
into problems because of it has to look out for themselves. However, in
doing so I believe we would run the risk of significantly lowering the
value of whatever the project might come up with.
The alternative is to plan for and spend effort looking through the JDK
runtime and JVM code to identify where it needs to be made resilient to
save and restore and provide means for any potential disruption that it
might cause to be detected and either repaired or, at least, managed
with a graceful failure. I believe very firmly that this latter course
is the one we should follow.
More information about the discuss