CFV: New Project: CRaC

Fri Aug 6 09:49:51 UTC 2021

On 05/08/2021 21:06, Anton Kozlov wrote:
> The checkpoint/restore concept does not require that big changes to
> the Java lang.  Two (or more) phases of the Java application
> life-cycle are implicit. The execution of a Java program is the same,
> but it may just pause for a while. It's much different from
> compile-time/run-time separation for static images.

My father had a good friend who frequently relied on the gambit "It's a 
well-known fact that ..." to sway an argument. Well, I am afraid that 
the above assertion about the limited room for things to go wrong thanks 
to a 'simple' save and restore does not self-evidently qualify as 
commonly agreed fact. It might well be true for /some/ apps, especially 
if they are saved and then restored on exactly the same host from the 
same parent process with little or no intervening change to the 
underlying operating and process system environment (and ignoring any 
effect the pause might have on timings). However, even in those limited, 
albeit rather vaguely defined, circumstances I don't think we can 
presume it will be true for all apps.

It's worse than that though. For the primary use case I think you are 
talking about -- running services in a container on cloud infrastructure 
-- I think it is quite possible that a deployment may fail even to meet 
those rather vaguely set out requirements.   Might a saved app state not 
be resumed many times, possibly concurrently? Might local OS resources 
conceivably change between restarts? Might the parent process env 
conceivably change between restarts? Indeed, might those effects not 
actually occur because of the very fact that the app is being restarted?

Unfortunately, the current operation of the JVM and JDK implicitly 
presumes a high degree of continuity in the process and OS env during 
execution. It has most definitely not been made 100% resilient to any 
sudden discontinuity from one point of execution to the next thanks to 
whatever change might happen in some intervening hiatus caused by a save 
and subsequent restore.

We could ignore that problem on the grounds that it is not a goal of the 
project to consider it and claim, in consequence, that anyone running 
into problems because of it has to look out for themselves. However, in 
doing so I believe we would run the risk of significantly lowering the 
value of whatever the project might come up with.

The alternative is to plan for and spend effort looking through the JDK 
runtime and JVM code to identify where it needs to be made resilient to 
save and restore and provide means for any potential disruption that it 
might cause to be detected and either repaired or, at least, managed 
with a graceful failure. I believe very firmly that this latter course 
is the one we should follow.

regards,

Andrew Dinn
-----------