CRaC + maven-daemon: An experience report

Wed Jun 29 13:22:38 UTC 2022

On Fri, Jun 17, 2022 at 2:09 PM Volker Simonis <volker.simonis at gmail.com> wrote:
>
> On Fri, Jun 10, 2022 at 3:29 PM Anton Kozlov <akozlov at azul.com> wrote:
> >
> > On 6/8/22 16:19, Dan Heidinga wrote:
> > >>> Snapsafety
> > >>
> > >> I still struggle to understand what is it. Is it a property of the code
> > >> (e.g.  if you use these classes, you are safe w.r.t. checkpoint and
> > >> restore and don't need to coordinate explicitly)? Or is it a property of
> > >> the state (the state can be safely checkpoint and restore -- what is
> > >> safe in this case)?
> > >>
> > >
> > > This is exactly the point Ashu's making in the document.  As much as I
> > > think we would all like snapsafety to be a static property of the
> > > source code so we could analyze it easily with some static analysis,
> > > it's unfortunately more complicated than a static property.
> > >
>
> I agree. And it also depends on the snapshotting mechanism. E.g. with
> CRaC+CRIU there's a lot of code to properly close/reopen files and
> handle file descriptors correctly. If we are using CRaC with
> Firecracker instead which takes a complete OS snapshot, that all
> becomes unnecessary, because all the files will still be there with
> the exact same state after the restore.

With the OpenJ9 CRIU approach, we've found that by targeting
containers, we don't need to force all files to be closed/reopened as
CRIU handles it well.  Only files that are mounted into the container
need special handling and that's basically user-configuration.  Not
forcing files to be opened / closed side steps a whole host of
problems, but makes it much harder to run on bare metal and makes the
system more dependent on the checkpointing mechanism.  So tradeoffs.

I feel a bit like a broken record saying this (and maybe that's just
due to repeating it internally for so long =), but I think the
programming model is critical here.  If we find a better way to
express relationships between dependencies and "phases", we will end
up with programs that are both more amenable to checkpoint/restore,
and also to being pre-initialized (a la Leyden).  This may make it
harder to retrofit existing programs but will provide a more stable &
most importantly, predictable base to build new programs on.

>
> >
> > Hence my question :) Let's say that to be a property of the object
> > state.  Then a class has a property if all objects of the class have the
> > property in all possible states.  I don't see any other way for
> > correctness and secureness to be defined for a class, other than
> > providing Resource implementation on a per-class basis, and taking care
> > of class functionality and internal invariants.  That is, no mark or a
> > predicate on the code or the Resource implementation can imply real
> > safety -- that will always remain a non-formal property that should be
> > aligned with the surrounding context in the class.  For example, adding
> > another field and its initializaion can change the class from safe to
> > unsafe.  Such change is hard to correlate with necessary changes in
> > Resource implementation.
> >
> > > There's a temporal aspect to it - when the checkpoint is taken affects
> > > the safety of the operation.  When the snapshot is taken determines
> > > what would need to be fixed up (and much of that is based on
> > > application specific invariants).
> > >
> > > The execution model on restore [0] also impacts the snapsafety.  As
> > > Ashu says, using the checkpoint to create an initialized base image
> > > has a different concept of "safety" than migrating a computation from
> > > one host to another.  Different pieces of state will need to be
> > > modified in each case and different invariants will hold (or be
> > > broken).
> >
> > Indeed. Is the formal property worth pursuing then?  This was going to
> > be a language aid for app developers to annotate safe parts of their
> > programs, and for us to annotate parts of JDK.  While we can attempt to
> > annotate JDK correctly and fully, we cannot control how the language
> > feature will be used by users.  And for them, a better annotation
> > mechanism or a programming model (like reactive programming) may exist.
> > How about letting users decide how and when to annotate their programs,
> > and concentrate on JDK needs and how JDK is used by applications first,
> > as we understand these better?  For example, what's missing in the JDK
> > so the app won't need changing at all?  And what parts of the app are
> > absolutely necessary to change.  Is it possible for JDK to provide a set
> > of utilities to ease those changes?
> >
> > > The .NET community took an interesting approach in their "Native AOT"
> > > story for "trimming" applications [1] that may be reusable for
> > > snapsaftey - they added warnings for certain operations that are
> > > incompatible with trimming (dead code elimination) and then require
> > > library authors to annotate methods that do generate the warnings.
> > > The annotations bubble up the call chain to the public apis and then
> > > library consumers can determine whether to call such apis or not.
> > >
> > > Building on this idea, if methods and classes are correctly annotated
> > > (with what annotations?  tbd) it may be possible to do some analysis
> > > when the checkpoint is created to determine whether the current state
> > > is "snapsafe" or not.  This is not so much a static property that can
> > > be statically analyzed, but one that must be checked when taking the
> > > checkpoint as it may require walking stacks (currently executing
> > > methods), examining loaded classes, heap walks(?), etc.
> >
> > Now it's possible to create a runtime check for the object state safety,
> > that is to create Resource's beforeCheckpoint. An unsafe object may
> > always throw an Exception. Won't this be even more flexible? This
> > relates a lot to Snapsafety of core library classes [1], I'll reply
> > there.
>
> In the context of my above comment on the different checkpoint
> mechanisms (i.e. CRIU vs. Firecracker) I was already thinking about
> annotating the CRaC callbacks such that they will only be called if
> necessary, based on the snapshotting mechanism. The question is if it
> will possible at all to come up with a fixed, predefined set of such
> abstract "snapsafety annotations" or if  there are just too many
> different use cases and contexts?

The more points we can identify as needing fixups, the clearer a
picture we'll have of the landscape.  The CRaC callbacks provide one
set of use cases and contexts.  GraalVM's SubstrateVM Substitution
mechanism provides another view of the places that need to be fixed
up.  OpenJ9's J9InternalCheckpointHookAPI::register{PreCheckpoint/PostRestore}Hook
APIs is another data point.

We're still early days in identifying the places that need to be fixed
up and it will require trying to run (more) applications with CRaC to
find the long tail of required fixups.  If annotating the existing
CRaC callbacks helps to skip some fixups and test a broader set of
applications - I'm all for it!

What kind of annotations were you thinking?  Maybe start with
Firecracker-specific ones that we can generalize from?
@FirecrackerSkip?

--Dan

>
> >
> > Thanks for bringing more context,
> > -- Anton
> >
> > [1] https://mail.openjdk.java.net/pipermail/crac-dev/2022-May/000222.html
> >
>