CRaC + maven-daemon: An experience report

Fri Jun 17 18:08:58 UTC 2022

On Fri, Jun 10, 2022 at 3:29 PM Anton Kozlov <akozlov at azul.com> wrote:
>
> On 6/8/22 16:19, Dan Heidinga wrote:
> >>> Snapsafety
> >>
> >> I still struggle to understand what is it. Is it a property of the code
> >> (e.g.  if you use these classes, you are safe w.r.t. checkpoint and
> >> restore and don't need to coordinate explicitly)? Or is it a property of
> >> the state (the state can be safely checkpoint and restore -- what is
> >> safe in this case)?
> >>
> >
> > This is exactly the point Ashu's making in the document.  As much as I
> > think we would all like snapsafety to be a static property of the
> > source code so we could analyze it easily with some static analysis,
> > it's unfortunately more complicated than a static property.
> >

I agree. And it also depends on the snapshotting mechanism. E.g. with
CRaC+CRIU there's a lot of code to properly close/reopen files and
handle file descriptors correctly. If we are using CRaC with
Firecracker instead which takes a complete OS snapshot, that all
becomes unnecessary, because all the files will still be there with
the exact same state after the restore.

>
> Hence my question :) Let's say that to be a property of the object
> state.  Then a class has a property if all objects of the class have the
> property in all possible states.  I don't see any other way for
> correctness and secureness to be defined for a class, other than
> providing Resource implementation on a per-class basis, and taking care
> of class functionality and internal invariants.  That is, no mark or a
> predicate on the code or the Resource implementation can imply real
> safety -- that will always remain a non-formal property that should be
> aligned with the surrounding context in the class.  For example, adding
> another field and its initializaion can change the class from safe to
> unsafe.  Such change is hard to correlate with necessary changes in
> Resource implementation.
>
> > There's a temporal aspect to it - when the checkpoint is taken affects
> > the safety of the operation.  When the snapshot is taken determines
> > what would need to be fixed up (and much of that is based on
> > application specific invariants).
> >
> > The execution model on restore [0] also impacts the snapsafety.  As
> > Ashu says, using the checkpoint to create an initialized base image
> > has a different concept of "safety" than migrating a computation from
> > one host to another.  Different pieces of state will need to be
> > modified in each case and different invariants will hold (or be
> > broken).
>
> Indeed. Is the formal property worth pursuing then?  This was going to
> be a language aid for app developers to annotate safe parts of their
> programs, and for us to annotate parts of JDK.  While we can attempt to
> annotate JDK correctly and fully, we cannot control how the language
> feature will be used by users.  And for them, a better annotation
> mechanism or a programming model (like reactive programming) may exist.
> How about letting users decide how and when to annotate their programs,
> and concentrate on JDK needs and how JDK is used by applications first,
> as we understand these better?  For example, what's missing in the JDK
> so the app won't need changing at all?  And what parts of the app are
> absolutely necessary to change.  Is it possible for JDK to provide a set
> of utilities to ease those changes?
>
> > The .NET community took an interesting approach in their "Native AOT"
> > story for "trimming" applications [1] that may be reusable for
> > snapsaftey - they added warnings for certain operations that are
> > incompatible with trimming (dead code elimination) and then require
> > library authors to annotate methods that do generate the warnings.
> > The annotations bubble up the call chain to the public apis and then
> > library consumers can determine whether to call such apis or not.
> >
> > Building on this idea, if methods and classes are correctly annotated
> > (with what annotations?  tbd) it may be possible to do some analysis
> > when the checkpoint is created to determine whether the current state
> > is "snapsafe" or not.  This is not so much a static property that can
> > be statically analyzed, but one that must be checked when taking the
> > checkpoint as it may require walking stacks (currently executing
> > methods), examining loaded classes, heap walks(?), etc.
>
> Now it's possible to create a runtime check for the object state safety,
> that is to create Resource's beforeCheckpoint. An unsafe object may
> always throw an Exception. Won't this be even more flexible? This
> relates a lot to Snapsafety of core library classes [1], I'll reply
> there.

In the context of my above comment on the different checkpoint
mechanisms (i.e. CRIU vs. Firecracker) I was already thinking about
annotating the CRaC callbacks such that they will only be called if
necessary, based on the snapshotting mechanism. The question is if it
will possible at all to come up with a fixed, predefined set of such
abstract "snapsafety annotations" or if  there are just too many
different use cases and contexts?

>
> Thanks for bringing more context,
> -- Anton
>
> [1] https://mail.openjdk.java.net/pipermail/crac-dev/2022-May/000222.html
>