Snapsafety of core library classes

Fri May 20 13:38:18 UTC 2022

(CCing a couple of the OpenJ9 developers involved in the CRIU-efforts
for their awareness as well)

On Thu, May 19, 2022 at 8:09 AM Volker Simonis <volker.simonis at gmail.com> wrote:
>
> Hi,
>
> I wonder if anybody has thought about how snapsafety for the core
> library classes should be implemented in CRaC? By "snapsafety" I mean
> correct and secure operation after restoring a JVM process which was
> previously checkpointed and possibly cloned.

This is currently being developed on an ad-hoc basis in CRaC.  Look
for classes that implement the jdk.crac.Resource interface and the
actions they take in the ::afterRestore / ::beforeCheckpoint methods
to see how each class has addressed its own "snapsafety".

To your point, I think we're still exploring and determining the cases
that are snapsafe (or not).  We can look at the classes GraalVM has
patched with Substitutions as a starting set of classes that will need
adaptation to be snapsafe.  That will help identify a starting set but
the full set will be larger.

> The first question is about deciding which classes can be considered
> snapsafe? Naively any class whose objects hold some state will be
> affected by snapshotting and cloning. For simple classes like String
> or Integer we know that their objects are constant and cloning them
> doesn't do any harm. Objects of other classes might however contain
> more sensitive state like caches, unique identifiers, certificates,
> encryption keys etc. which shouldn't be cloned or which become invalid
> after restore.

Agreed.  Though each class will need to be individually examined to
ensure that the changes to make it snapshot don't break the invariants
of the class. Looking just at caches as an example, it may seem safe
to clean out the cache before a checkpoint but doing so may break
invariants about canonicalization of values as those looked up prior
to the checkpoint may be different (not ==) to those looked up after
restore.

> By looking at the current CRaC repository [1] I can see that some
> classes (e.g. sun.security.provider.SecureRandom or
> sun.security.provider.NativePRNG.RandomIO) directly implement
> j.i.c.JDKResource in order to make them snapsafe. But all the classes
> which do so, are non-public. This means that snapsafety is currently a
> "hidden", implicit feature of some classes in the core library (i.e.
> if I create a new j.s.SecureRandom object, I can not know if it will
> be snapsafe or not).
>
> Do we want to make snapsafety an undocumented, implicit feature or do
> we want to explicitly call it out in the JavaDoc, e.g. by forcing
> classes which want to be snapsafe to implement javax.crac.Resource
> (similar to implementing Serializable)?

Bringing snapsafety into the language makes sense.  Implementing
Resource is probably overkill for most classes as their safety is an
emergent property of the field's snap safety.  Can we reverse this to
tag "snap-unsafe" classes and have javac warn / error when compiling a
class with snap-unsafe fields unless they implement Resource?

Does the concept of snapsafety need to differentiate between the
static state of the class and its instances?

>
> I think both approaches have their pros and cons. If we make
> snapsafety an explicit feature, we tell users that the corresponding
> classes will behave correctly on snapshot and restore events. But what
> about all the other classes in the core libraries. Are they all
> snapsafe or snapunsafe by default?
>
> If we make snapsafety an implicit feature it would become an
> "implementation detail". This means we could have JDKs which are
> snapsafe while other are not. It also means we could make older JDK
> version snapsafe which would not be possible with the explicit model
> because it is impossible to retrofit classes in older releases to
> implement new interfaces.

I'd prefer to make it explicit in the programming model to avoid the
"sins of serialization".  Brian wrote a document titled "Towards
Better Serialization" [A] where it outlines the issues with
serialization, including:
* "Pretends to be a library feature, but isn't",
* "Pretends to be a statically typed feature, but isn't", and
* "Magic methods and fields".

We should be thinking about snapsafety in the context of serialization
(as that's effectively what a snapshot is) and any solution we propose
should be clear on how it avoids the sins of serialization.

>
> @Dan: I remember you've mentioned that you've experimented with CRiU
> in OpenJ9 as well. I'd be specifically interested about the core
> library changes you had to do in order to make the JDK snapsafe. I
> took a look at the OpenJ9 snapshot branch [2] , but couldn't find and
> library changes there at all? Could you please share more details on
> this topic if possible?

The snapshot branch (now inactive) was our experiment to do
snapshot/restore directly in the JVM.  OpenJ9's since switched to
working on CRIU checkpoint/restore as it allows solving a smaller
problem (the libraries, basically) first.  The code for this is in the
master branch under feature flags.

The J9 approach to CRIU has a slightly different model than that used
in CRaC.  It's model is to treat lifecycle hooks (equivalent of
jdk.crac.Resource) in basically three layers:
* application level hooks,
* class library hooks, and
* JVM hooks.

The VM enters a single threaded mode to execute the class library
hooks and the JVM hooks.  This avoids some difficult interactions
between updating things and having both the updated and original
values being consumed at the same time at the cost of some (potential)
deadlock concerns.  Similarly, we use a single threaded mode on
restore as well.

The two major class library level hooks we've added so far address
environment variables [B] and security providers [C].

For the env vars, we only allow setting new env vars at restore.  This
honours the spirit of the JVM's existing approach to cache the env
vars on first access, and prevents inconsistent views of what the env
is actually set to so we avoid having old vs new consistency issues.

For security providers, J9 installs a minimal provider prior to the
checkpoint to avoid caching sensitive state in the checkpoint.  At
restore, it removes the minimal provider and installs the full set of
real providers.

For the JVM hooks, J9 uses a heap walk to apply per-object fixups.
This is how j.u.Random is reseeded [D].

We've also been looking at Timers to determine how to adapt them to
account for the time lapse between the checkpoint and restore [E].

The use cases we've been looking at are primarily containerized
applications.  The level of snapsafety needed when the container has
limited distribution is probably less than needed when the container
(or checkpoint) will be broadly distributed.  We need to look at
snapsafety as a layered approach that depends in part on the
deployment model.  The more widely deployed a checkpoint image is
shared, the more care is needed to redact/fixup the info included in
the image.

Sorry for the slightly rambling response, but there are lots of
different angles we can look at snapsafety through.

--Dan

[A] https://openjdk.java.net/projects/amber/design-notes/towards-better-serialization
[B] https://github.com/eclipse-openj9/openj9/blob/45e4b0bd91018ffd35c5e2d72dd27632a84af5d2/jcl/src/openj9.criu/share/classes/org/eclipse/openj9/criu/CRIUSupport.java#L489
[C] https://github.com/eclipse-openj9/openj9/blob/45e4b0bd91018ffd35c5e2d72dd27632a84af5d2/jcl/src/openj9.criu/share/classes/org/eclipse/openj9/criu/SecurityProviders.java#L28-L39
[D] https://github.com/eclipse-openj9/openj9/blob/cc586f03bd5359157f99fb342015f24f4e064755/runtime/vm/CRIUHelpers.cpp#L160-L169
[E] https://github.com/eclipse-openj9/openj9/issues/14211#issuecomment-1117937739

>
> What are your thoughts on this issue?
>
> Best regards,
> Volker
>
> [1]  https://github.com/openjdk/crac/compare/crac?expand=1#diff-b7061481
> [2] https://github.com/eclipse-openj9/openj9/compare/snapshot#diff-54ac925d
>