CRaC + maven-daemon: An experience report

Wed Jun 8 13:19:46 UTC 2022

> > Snapsafety
>
> I still struggle to understand what is it. Is it a property of the code
> (e.g.  if you use these classes, you are safe w.r.t. checkpoint and
> restore and don't need to coordinate explicitly)? Or is it a property of
> the state (the state can be safely checkpoint and restore -- what is
> safe in this case)?
>

This is exactly the point Ashu's making in the document.  As much as I
think we would all like snapsafety to be a static property of the
source code so we could analyze it easily with some static analysis,
it's unfortunately more complicated than a static property.

There's a temporal aspect to it - when the checkpoint is taken affects
the safety of the operation.  When the snapshot is taken determines
what would need to be fixed up (and much of that is based on
application specific invariants).

The execution model on restore [0] also impacts the snapsafety.  As
Ashu says, using the checkpoint to create an initialized base image
has a different concept of "safety" than migrating a computation from
one host to another.  Different pieces of state will need to be
modified in each case and different invariants will hold (or be
broken).

The .NET community took an interesting approach in their "Native AOT"
story for "trimming" applications [1] that may be reusable for
snapsaftey - they added warnings for certain operations that are
incompatible with trimming (dead code elimination) and then require
library authors to annotate methods that do generate the warnings.
The annotations bubble up the call chain to the public apis and then
library consumers can determine whether to call such apis or not.

Building on this idea, if methods and classes are correctly annotated
(with what annotations?  tbd) it may be possible to do some analysis
when the checkpoint is created to determine whether the current state
is "snapsafe" or not.  This is not so much a static property that can
be statically analyzed, but one that must be checked when taking the
checkpoint as it may require walking stacks (currently executing
methods), examining loaded classes, heap walks(?), etc.

I don't have a fully fleshed out idea here but wanted to float some
early thoughts.  The Leyden project may benefit from some of this
exploration as well as it will have to tread similar ground.

--Dan

[0] https://danheidinga.github.io/phase-aware-source-code/
[1] https://docs.microsoft.com/en-us/dotnet/core/deploying/trimming/prepare-libraries-for-trimming