Portability of checkpoints?
Dan Heidinga
heidinga at redhat.com
Wed Oct 20 14:00:10 UTC 2021
> This is a good use case that we'd like to support. But what java class level
> changes would we need in the context of the different CPU?
ForkJoinPool is the canonical example here - the common pool, used by
parallel streams, is initially sized based on the number of available
processors. See code in [7] which is called by the static
initializer. This can change when snapshotting on one machine and
restoring on another.
All uses of `Runtime::availableProcessors` will need to be evaluated -
of which there are several in j.u.c and in the nio ThreadPool - to see
if they also need to be adapted. There are sure to be other apis that
need to be similarly investigated - Unsafe.pageSize() is one that
comes to mind. MxBeans may be another.
> > As an example, in OpenJ9 we added a commandline option to tell the jit to
> > generate more conservative code ... Does Hotspot have similar options already
> > or do we need to pursue adding them as part of this project?
>
> There are no such options now, and it will be great to have them as the first
> step toward. AFAICS, some kind of framework for CPU flags presents in aarch64
> and x86 [3] and is used e.g. in [4]. But before implementation, it is worth
> asking for a review of a plan on hotspot-dev maillist [5].
Thanks for the links. Looks like I have some reading to do to figure
out what's already available and where it might need to evolve to.
> > The discussion in [1] covers some of the background on determining default
> > processor features and [2] is a list of differences between
> > creation/restore environments that will need to be addressed for
> > portability.
>
> This is valuable info, such a list could you copy it here? [1]. The list [2] is
> very valuable in the context of CRaC and all preliminary discussions about the
> project.
As per John's note [1], I've hosted the content on crojn at [8] to
avoid pasting a chunk of markdown to the list.
> Leaving topics that require changes to the Java API, it looks like there are
> different levels of portability.
>
> 0. No portability. Able to restore on the same machine: CPU, operating system,
> and, probably, the OS has not restarted since the checkpoint. But it still may
> be useful for similar java programs like a scaling microservice; or javac [6].
Depending on the checkpoint/restore mechanism, this may also require
that files (ie: logs) haven't changed between the checkpoint & the
restore. CRaC stops the checkpoint if there are open file handles but
that's not a strict requirement of the underlying checkpoint mechanism
(ie: CRIU is able to restore them). There's a fine line between
"things that cause restore failures" and "things that prevent
portability". I may be falling into the first category here.
No portability seems very similar to the use cases for Java daemons
such as Nailgun [9]. Useful for some small set of cases but less
applicable?
> 1. Between machines with the same operating system distribution. The CPU
> features set is a good example of this. Also, available memory resources can
> change between checkpoint and restore. We'll likely need to change JVM to
> handle the difference. Here we have containers -- it's interesting that even
> when starting on the same physical machine (same CPU), a container instance
> used for the checkpoint and a container for the restore may have different
> hard memory limits.
>
> 2. Between different distributions of the same operating system e.g. GNU/Linux.
> For checkpoint/restore implemented on top of CRIU it will be a problem since it
> stores a complete process memory. It captures the internal layout and the
> state of the system libraries such as libc, which may change between
> distributions. This level corresponds to the portability of a regular java
> build.
The "sweet spot" for checkpoint/restore may be in containers as they
constrain the environment reducing the set of things to deal with.
Though, as you point out above, even that's not a perfect solution as
limits (memory / cpu / etc) can still change for container
deployments.
>
> There could be more levels, like:
>
> 3. Portability between different operating systems, e.g. Linux and Windows.
> Unlikely it will be practical to implement and we'll be unable to transfer any
> JNI code.
>
> -1. No portability, single restore. This can be implemented by sending Unix
> signals SIGSTOP/CONT, is it useful for testing?..
I don't see either of these two levels as particularly useful. (Would
love to hear contrary opinions though)
--Dan
[7] https://github.com/openjdk/jdk/blob/895e2bd7c0bded5283eca8792fbfb287bb75016b/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java#L2564
[8] http://cr.openjdk.java.net/~heidinga/crac/snapshot_env_differences.md
[9] http://martiansoftware.com/nailgun/background.html
>
> Thanks,
> Anton
>
> [1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-October/055049.html
> [2] https://github.com/eclipse-openj9/openj9/issues/12484
> [3] https://github.com/openjdk/crac/blob/master/src/hotspot/cpu/x86/vm_version_x86.hpp#L305
> [4] https://github.com/openjdk/crac/blob/master/src/hotspot/share/jvmci/vmStructs_jvmci.cpp#L760
> [5] https://mail.openjdk.java.net/pipermail/hotspot-dev/
> [6] https://mail.openjdk.java.net/pipermail/discuss/2021-February/005714.html
>
More information about the crac-dev
mailing list