Portability of checkpoints?

Thu Oct 21 13:46:28 UTC 2021

On 10/20/21 17:00, Dan Heidinga wrote:
>> This is a good use case that we'd like to support.  But what java class level
>> changes would we need in the context of the different CPU?
>
> ForkJoinPool is the canonical example here - the common pool, used by
> parallel streams, is initially sized based on the number of available
> processors.  See code in [7] which is called by the static
> initializer.  This can change when snapshotting on one machine and
> restoring on another.
>
> All uses of `Runtime::availableProcessors` will need to be evaluated -
> of which there are several in j.u.c and in the nio ThreadPool - to see
> if they also need to be adapted.  There are sure to be other apis that
> need to be similarly investigated - Unsafe.pageSize() is one that
> comes to mind.  MxBeans may be another.

Oh, got it.  Completely agree.  I initially considered these as a part of the
bigger problem, as nothing prevents users from implementing similar but own
ForkJoin.  Here assumptions about base methods such as the
j.l.Runtime.availableProcessors needs examination, as well as callers in the
JDK, and there may be users' code that similarly uses base methods.

Interesting that availableProcessors can return different values over the
lifetime for a long time already [1].  It seems ForkJoinPool just needs
cooperation with CRaC, the API is good here.

> As per John's note [1], I've hosted the content on crojn at [8] to
> avoid pasting a chunk of markdown to the list.

Great, thanks!  There is a set of things for which users will need to fix their
code.  An example j.u.Locale.getDefault [2], for which we may need the
speicification.  Another set of things hopefully can be fixed without changes
to the exposed API, at most requiring CSR [3], like the target CPU architecture
requiring a set of new Hotspot flags.  We are lucky availableProcessors already
able to return different values, although I assume not so much code actually
expects this.

Probably we may try to translate this list into a list of API sites that needs
thinking of.

> Depending on the checkpoint/restore mechanism, this may also require
> that files (ie: logs) haven't changed between the checkpoint & the
> restore.  CRaC stops the checkpoint if there are open file handles but
> that's not a strict requirement of the underlying checkpoint mechanism
> (ie: CRIU is able to restore them).  There's a fine line between
> "things that cause restore failures" and "things that prevent
> portability".  I may be falling into the first category here.

For me it sounds that portabillity is about implementation and restore failures
is about semantic and Java API.  Would it be correct to continue to think so?

> No portability seems very similar to the use cases for Java daemons
> such as Nailgun [9].  Useful for some small set of cases but less
> applicable?

Great link. As a bit crazy and not a mature idea: would not applications
running under Nailgun benefit from CRaC and be able to reinitialize?  Nailgun
could be another checkpoint/restore engine.

>> -1. No portability, single restore.  This can be implemented by sending Unix
>> signals SIGSTOP/CONT, is it useful for testing?..
>
> I don't see either of these two levels as particularly useful.  (Would
> love to hear contrary opinions though)

I'm tempted to think about level -1 as the simplest non-CRIU based
implementation to try another mechanism.  It may also show some time-based
effects of checkpoint/restore.  Not something of real-world use.

Thanks,
Anton

[1] https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#availableProcessors()
[2] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Locale.html#getDefault()
[3] https://wiki.openjdk.java.net/display/csr/CSR+FAQs