Proposal for documentation and snapsafety

Mon Mar 20 10:53:05 UTC 2023

On 15. 03. 23 16:36, Anton Kozlov wrote:
> On 2/27/23 11:01, Radim Vansa wrote:
>> While all JDK code can be eventually fixed in a similar way to 
>> SecureRandom I
>> think that it's clear that not everything can be encapsulated. A good 
>> example
>> are the environment variables, but also number of processors and many 
>> others
>> [2].
>
> It looks the problem here is not technical correctness, but still the 
> user
> experience, right?  I.e. by providing good documentation (javadoc == 
> Java SE
> API), we can provide a good level of specification of what to expect on
> checkpoint and on restore.  But adhering to that specification is 
> complicated
> for users, as it is still a hard task to find uses of a method with a 
> semantic
> that has been changed in CRaC.  Is this understanding of the problem 
> correct?

Yes, though I would like to stress out that these usages might occur in 
legacy/3rd party code the user has little knowledge about.

>
>> I propose to tag any method/constructor that returns data that could be
>> expected to stay constant in non-C/R app but often changes after 
>> restore, or
>> an object that will need handling through a registered Resource, with
>> @CracSensitive (3) annotation.  We will provide a tool that will report
>> places that could call these methods, unless marked with @CracSafe
>> annotation. This tool could work in a static way (scanning set of JARs,
>> probably with a thin Maven plugin as well?) and as a javaagent, scanning
>> classes as these get loaded.
>>
>> Naturally not all code invoking non-snap-safe methods is from user 
>> code, many
>> cases come from the libraries. Alternative way to allow-list places 
>> calling
>> @CracSensitive methods in places that cannot be changed directly 
>> would be
>> provided, though eventually we aim at encouraging that the libraries 
>> adopt
>> the @CracSafe internally.
>
> While technically this is possible, there are a few drawbacks IMHO.  
> First, the
> tool and annotations are interdependent, although the dependency of 
> annotations
> on the tool is implicit.  But anyway, annotations do not make any 
> sense without
> the tool checking them.  So, either the tool and annotations are 
> somehow should
> be completely external to the JDK, or both of them should be in the JDK.

I agree; I assumed that the core of the tool would live in JDK, though 
for practical reasons there would be external integrations (e.g. Maven 
plugin) outside JDK repo.

Based on our offline discussion I agree that listing the crac-sensitive 
methods externally would work. I was considering to piggy-back on the 
free-form contents of @SuppressWarnings annotation (making sure that 
"crac" or a similar string is accepted) to implement this rather than 
standalone annotations, but this one has source-level retention, 
therefore it can't be used if the tool is supposed to analyze usages in 
dependencies, too.

>
> But, I'm not sure the tool is the best approach.  That does not take 
> advantage
> of being able to track exact calls of the annotated methods before the
> checkpoint and after restore.  For example, querying the number of 
> processors
> is fine if happens after the restore.  So the tool would need somehow to
> distinguish calls of annotated methods before checkpoint (where 
> previously
> returned results may become obsolete) and those after restore, 
> otherwise, there
> will be some number of false positives, and those false positives 
> would require
> some way to silence them after consideration.  Also, even before the
> checkpoint, having a call in the code does not mean that will be actually
> called e.g. because of some specific configuration that disables 
> detection of
> the number of processes. So it seems without pretty complex static 
> dataflow
> analysis we'll have another source of false positives.

I never intended to perform complex analysis in the tooling, that would 
turn into a can of worms. False positives is rather misleading term in 
here as the tool is not supposed to provide a definitive guidance, but 
let's not spend time on nomenclature. Silencing the possible problems is 
the integral part of the proposal, that's what the @CracSafe would be 
used for.

This line of thought led me to questioning if the API (e.g. Core class) 
could expose a method to find the C/R generation (e.g. 0 = before first 
checkpoint, 1 = after restore, 2 = after restore from second checkpoint 
once we support that in the implementation...) to be able to write 
assertions based on that (e.g. method should not be called before 
checkpoint). It's very simple to implement a resource tracking this on 
your own, but a standard way might be better.

>
> Have you considered taking advantage of actually running the program?  
> E.g.
> recording stack traces for methods calls and reporting them on the 
> checkpoint,
> like in PR #43 [1].  Compared to the separate tool, the call recording 
> reports
> only calls that have happened, and only before the checkpoint. The 
> stack trace
> provides some information about how the result will be used (although not
> complete info on how the result of the method is going to be used).  The
> implementation will probably be very simple, and by some convention, 
> we can
> agree on a way to exclude some stack traces from reporting, e.g. by 
> having a
> specific stack trace element.
>
> Does the tool have an advantage over the recording of method calls and 
> stack
> traces?

I consider being able to run ahead a big advantage; this is probably a 
personal preference and depends on how much the checkpoint-restore 
scenario would be tested in practice - I expect rather optimistic 
approach during adoption that will skip many cases that could happen in 
practice. A runtime-checking system is definitely possible, too, and 
these two solutions could complement each other, leaving the choice up 
to the user. It would be useful to share the 'declarative part' (marking 
problematic methods and verified usages), though. Checking at runtime 
would give you more details, less 'false positives' but also no 
guarantees for code paths not executed.

Any runtime checker implementation should make sure, though, that in the 
production there's no impact. You're right that javaagent checking 
loaded code could be too much of a middle-ground, we could instrument 
the call sites to go through validating proxies in this mode.

Radim

>
> [1] https://github.com/openjdk/crac/pull/43
>
> Thanks,
> Anton