Portability of checkpoints?

Tue Oct 19 14:10:14 UTC 2021

On 10/15/21 22:52, Dan Heidinga wrote:
> How portable should CRaC checkpoints be?

The ability to run the image in a different environment makes CRaC useful.
So I think the answer is as much as practical.

> The implication of this approach is that a checkpoint created on one
> machine may not be valid on another due to changes in the target
> architecture in addition to changes in the environment.  It would be good
> if we could surface a list of the things that will need to be changed in
> the jvm and in the class libraries to address this.

This is a good use case that we'd like to support.  But what java class level
changes would we need in the context of the different CPU?

> As an example, in OpenJ9 we added a commandline option to tell the jit to
> generate more conservative code ... Does Hotspot have similar options already
> or do we need to pursue adding them as part of this project?

There are no such options now, and it will be great to have them as the first
step toward.  AFAICS, some kind of framework for CPU flags presents in aarch64
and x86 [3] and is used e.g. in [4].  But before implementation, it is worth
asking for a review of a plan on hotspot-dev maillist [5].

> The discussion in [1] covers some of the background on determining default
> processor features and [2] is a list of differences between
> creation/restore environments that will need to be addressed for
> portability.

This is valuable info, such a list could you copy it here? [1]. The list [2] is
very valuable in the context of CRaC and all preliminary discussions about the
project.

Leaving topics that require changes to the Java API, it looks like there are
different levels of portability.

  0. No portability. Able to restore on the same machine: CPU, operating system,
and, probably, the OS has not restarted since the checkpoint.  But it still may
be useful for similar java programs like a scaling microservice; or javac [6].

  1. Between machines with the same operating system distribution.  The CPU
features set is a good example of this.  Also, available memory resources can
change between checkpoint and restore.  We'll likely need to change JVM to
handle the difference.  Here we have containers -- it's interesting that even
when starting on the same physical machine (same CPU), a container instance
used for the checkpoint and a container for the restore may have different
hard memory limits.

  2. Between different distributions of the same operating system e.g. GNU/Linux.
For checkpoint/restore implemented on top of CRIU it will be a problem since it
stores a complete process memory.  It captures the internal layout and the
state of the system libraries such as libc, which may change between
distributions.  This level corresponds to the portability of a regular java
build.

There could be more levels, like:

  3. Portability between different operating systems, e.g. Linux and Windows.
Unlikely it will be practical to implement and we'll be unable to transfer any
JNI code.

-1. No portability, single restore.  This can be implemented by sending Unix
signals SIGSTOP/CONT, is it useful for testing?..

Thanks,
Anton

[1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-October/055049.html
[2] https://github.com/eclipse-openj9/openj9/issues/12484
[3] https://github.com/openjdk/crac/blob/master/src/hotspot/cpu/x86/vm_version_x86.hpp#L305
[4] https://github.com/openjdk/crac/blob/master/src/hotspot/share/jvmci/vmStructs_jvmci.cpp#L760
[5] https://mail.openjdk.java.net/pipermail/hotspot-dev/
[6] https://mail.openjdk.java.net/pipermail/discuss/2021-February/005714.html