Snapsafety of core library classes

Fri Jun 10 13:51:05 UTC 2022

On 5/20/22 16:38, Dan Heidinga wrote:
> On Thu, May 19, 2022 at 8:09 AM Volker Simonis <volker.simonis at gmail.com> wrote:
>> I wonder if anybody has thought about how snapsafety for the core
>> library classes should be implemented in CRaC? By "snapsafety" I mean
>> correct and secure operation after restoring a JVM process which was
>> previously checkpointed and possibly cloned.

Apparently two major problems (likely with intersections) mentioned
here.  Does the JVM state makes sense and safe after was restored as an
another instance in possibly different environment.  And the same for
the state cloned.  Do we care only about these two situations?  I'm
trying to understand a set of conditions, so once something is proved to
be secure and correct in each of them, then that will be considered
secure and correct universally.

Still safety and correctness are properties that are hardly
formalizeable. They mean different things for different classes. And
single class may be correct or not depending on the context how it is
used.

> This is currently being developed on an ad-hoc basis in CRaC.  Look
> for classes that implement the jdk.crac.Resource interface and the
> actions they take in the ::afterRestore / ::beforeCheckpoint methods
> to see how each class has addressed its own "snapsafety".
> 
> To your point, I think we're still exploring and determining the cases
> that are snapsafe (or not).  We can look at the classes GraalVM has
> patched with Substitutions as a starting set of classes that will need
> adaptation to be snapsafe.  That will help identify a starting set but
> the full set will be larger.

A good starting point, but I think Substitutions contains more than we
need, like providing java equalents for otherwise unanalyzable native
code.

>> The first question is about deciding which classes can be considered
>> snapsafe? Naively any class whose objects hold some state will be
>> affected by snapshotting and cloning. For simple classes like String
>> or Integer we know that their objects are constant and cloning them
>> doesn't do any harm. Objects of other classes might however contain
>> more sensitive state like caches, unique identifiers, certificates,
>> encryption keys etc. which shouldn't be cloned or which become invalid
>> after restore.

Interesting, these examples are related to things outside of JVM.
Initially, I've considered these to be the source of most of the
problems. Good to know the j.l. Random example that's problematic if
cloned. It will be interesting to find others, that would uncover more
situations that we'll need to consider.

> Agreed.  Though each class will need to be individually examined to
> ensure that the changes to make it snapshot don't break the invariants
> of the class. 

This looks the same for me.

> Looking just at caches as an example, it may seem safe
> to clean out the cache before a checkpoint but doing so may break
> invariants about canonicalization of values as those looked up prior
> to the checkpoint may be different (not ==) to those looked up after
> restore.

Isn't this just breaking the sequential logic of the program? If an
operation could not be triggered at a random moment without breaking
application invariants, I suppose this is just not a good implementation
of the Resource. . This probably should be the Rule 0, don't break
yourself :).

>> By looking at the current CRaC repository [1] I can see that some
>> classes (e.g. sun.security.provider.SecureRandom or
>> sun.security.provider.NativePRNG.RandomIO) directly implement
>> j.i.c.JDKResource in order to make them snapsafe. But all the classes
>> which do so, are non-public. This means that snapsafety is currently a
>> "hidden", implicit feature of some classes in the core library (i.e.
>> if I create a new j.s.SecureRandom object, I can not know if it will
>> be snapsafe or not).
>>
>> Do we want to make snapsafety an undocumented, implicit feature or do
>> we want to explicitly call it out in the JavaDoc, e.g. by forcing
>> classes which want to be snapsafe to implement javax.crac.Resource
>> (similar to implementing Serializable)?

I think this should be a text in javadoc describing what happens on
checkpoint and restore.  Thus, we'll be able to specify the behavior in
the terms of the original code.  By reading text users will be able to
decide if their programs are correct and safe or do they need to do
something app-specific.  And we'll be able to specify intentionally
omitted handling, for j.l.Random why it is not reinitialized.

> Bringing snapsafety into the language makes sense.  Implementing
> Resource is probably overkill for most classes as their safety is an
> emergent property of the field's snap safety.  Can we reverse this to
> tag "snap-unsafe" classes and have javac warn / error when compiling a
> class with snap-unsafe fields unless they implement Resource?> 
> Does the concept of snapsafety need to differentiate between the
> static state of the class and its instances?
> 

After some thought, I think that the formal checkable property may harm.
By introducing one, we'll create two different programming models, a
regular java and "snapsafe java", splitting the language without a need.

With marking safe classes and classes unsafe by default (an option from
below), we'll make a lot of valid states non-checkpointable. We'll
likely enlarge safe classes set over time, making our new programming
model a moving target.

If classes are safe by default, what is the reason to mark anything?

With marking unsafe classes instead and classes are safe by default,
wouldn't it be better to fix unsafe ones?  Or unconditionally throw in
beforeCheckpoint, which is already supported -- only an existing live
but incompatible with checkpoint object will prevent checkpoint, not the
fact the dangerous code was referenced in the past.

>> I think both approaches have their pros and cons. If we make
>> snapsafety an explicit feature, we tell users that the corresponding
>> classes will behave correctly on snapshot and restore events. But what
>> about all the other classes in the core libraries. Are they all
>> snapsafe or snapunsafe by default
>>
>> If we make snapsafety an implicit feature it would become an
>> "implementation detail". This means we could have JDKs which are
>> snapsafe while other are not. It also means we could make older JDK
>> version snapsafe which would not be possible with the explicit model
>> because it is impossible to retrofit classes in older releases to
>> implement new interfaces.

I don't see the cons of this :) Negating pros for explicit one (users
don't know if they are safe on checkpoint), we cannot tell in advance
that the class satisfies all possible user needs and expectations.  When
the behavior is unclear from the doc, that is the bug in the JDK that
needs fixing by providing the doc, and, optionally, actions for the
checkpoint and restore.

> I'd prefer to make it explicit in the programming model to avoid the
> "sins of serialization".  Brian wrote a document titled "Towards
> Better Serialization" [A]

(Thanks for the great link!)

> where it outlines the issues with
> serialization, including:
> * "Pretends to be a library feature, but isn't",

"Serialization pretends to be a library feature — you opt in by
implementing the Serializable interface, and serialize with
ObjectOutputStream. In reality, though, serialization extracts object
state and recreates objects via privileged, extralinguistic mechanisms,
bypassing constructors and ignoring class and field accessibility."

What part of the CRaC exhibits the similar problem? I think the major
problem for Serialization here is that incapsulation is broken. Our
notification API is implemented in pure Java without reflections, etc.

> * "Pretends to be a statically typed feature, but isn't", and

"Serializability is a function of an object’s dynamic type, not its
static type; implements Serializable doesn’t actually mean that
instances are serializable, just that they are not overtly
serialization-hostile. So, despite the requirement to opt-in via the
static type system, doing so gives you little confidence that your
instances are actually serializable."

Not sure what is meant by dynamic type, but if we introduce a mark for
the code, we'll get the same low level of confidence the safety was
implemented correctly.

> * "Magic methods and fields".

"There are a number of “magic” methods and fields (in the sense that
they are not specified by any base class or interface) that affect the
behavior of serialization. ... Because these do not exist in any public
type, they’re hard to discover, and one cannot easily navigate to their
specification. They are also easy to accidentally get wrong; if you
spell them wrong, or get the signature wrong, or make them static
members when they should be instance members, no one tells you."

I think in Resource we don't have this problem.

Thanks,
Anton