Coordinated Restore at Checkpoint: A new project for start-up optimization?

Thu Sep 10 18:25:00 UTC 2020

Hello,

We would like to open a discussion about a new project focused on
"Coordinated Restore at Checkpoint".

A possible relevant project name might be Tubthumpting [9].

Over the years, we [at Azul] have tinkered with various ways to improve java
start-up time and warmup behavior for different use cases for such
improvements. One of the interesting focus areas has been the "starting of
a new instance" of an application that has already run instances using identical
code, a similar expected profile, and potentially a similar initialization
sequence in the past. This is a common scenario in modern application
deployments, when e.g. rolling out new code in continuous deployment
environment, and when e.g. elastically changing instance counts in e.g.
auto-scaling situations.

Checkpoint/Restore technologies have evolved in various forms over the past
few years, and are available in the multiple forms, including e.g. CRIU [1]
and Docker Checkpoint & Restore [2].  While Checkpoint/Restore capabilities
have been shown to work across a wide range of applications for e.g. live
process or application migration, there are various challenges present for
their generic application for new instance deployment. Many of these
challenges have to do with the need to deal with a checkpointed state that may
not be validly reproducible when restoring multiple instances from the same
checkpoint image.

This is where Coordinate Restore at Checkpoint (CRaC) comes in. At a high
level, CRaC aims to systemically address these challenges by facilitating
explicit and intentional coordination between checkpointed applications and
a checkpointing mechanism. Such coordination will allow applications to
proactively discard problematic state ahead of checkpointing and to
reestablish needed state upon restoration. [e.g. closing open file
descriptors ahead of a checkpoint, and recreating and binding them after a
restore].

Coordination is a powerful enabler in this space. Contrary to the approaches
attempting transparent, uncoordinated checkpoint/restore, CRaC's approach to
the date has focused on assisting with the detection of situations that would
prevent a successful checkpoint, and simply refusing to checkpoint if such
conditions are identified. This approach leaves it up to the application
frameworks and the applications themselves to remedy the situation during
development, and before attempting actual deployment (or simply accept
non-CRaC startup times since a restorable checkpoint state will not be
available).

In the Java arena, we aim to create a generic CRaC API that would allow
applications and/or application frameworks to coordinate with an arbitrary
checkpoint/restore mechanism, without being tied to a specific
implementation or to the operational means by which checkpointing and
restoration is achieved. Such an API would allow application frameworks
(e.g. Tomcat, Quarkus, MicroNaut, etc.) to perform the needed coordination
in a portable way, which would not require coding that is specific to a
checkpoint/restore mechanism.  E.g. the same Tomcat CRaC coordination code
would be able to properly coordinate with a generic Linux CRIU utility, with
Docker Checkpoint & Restore, or with future OpenJDK implementations that may
support checkpoint/restore functionality directly or via the use of
libraries or system services.

Our hope is to start a project that will focus on specifying a CRaC API, and
will provide at least one CRaC-supporting checkpoint/restore OpenJDK
implementation with the hope of eventual upstream inclusion in a future
OpenJDK version via associated JEPs. We would potentially want to include
the API in a future Java SE specification as well.

In reality, we expect that more than one checkpoint/restore mechanism may be
supported, as we have already identified at least two probable modes of
operation that would be useful for OpenJDK:

- We have prototyped [3] a JDK-driven, modified-CRIU [4] based
  checkpoint/restore implementation that leverages on-demand paging during
  startup to deliver very promising start times for e.g microservices
  running on Quarkus, Micronaut, and Tomcat, and reaching "full speed"
  condition in sub-50-msec times.[5]

- We anticipate external-to-the-JDK checkpoint/restore implementations such
  as Docker Checkpoint & Restore [2] and potential possible support within
  orchestration frameworks (such as future Kubernetes versions) will drive
  a need for non-Java-specific means of coordinating restoration from
  checkpointed conditions, and that in such environments JDKs will likely
  wish to provide external controls (such jcmd or other APIs) that would
  deal with coordination, but leave the actual checkpointing and restore
  work to external entities.

Below are short summaries of:
- CRaC API concepts
- What a prototype OpenJDK implementation looks like
- Preliminary uses of CRaC API in some application frameworks
- Some promising preliminary results

What do you think? Please chime in.

— Gil.

P.S. Anton Kozlov has done the vast majority of the technical work on this
so far, and will be joining the discussion here.

—————————————————
CRaC API, conceptually

The high-level concepts of a CRaC API as we see it thus far include:

- Application code (a "resource") can register its interest in coordinating
  with checkpoint/restore operations.

- When a checkpoint operation attempt is initiated, and before a checkpoint
  is actually taken, all registered "resources" will be notified that a
  checkpoint is being attempted via e.g. a beforeCheckpoint() call.

- A JDK may (and likely will) refuse to complete a checkpoint attempt if it
  encounters any application state that it does not know how to checkpoint
  or restore.  E.g. a JDK may (and likely will) refuse to complete a
  checkpoint attempt if any file descriptors that are not private to the
  JDK itself are open after all registered resources have been notified
  about the coming checkpoint attempt.

- When a restore operation occurs, all registered resources will be notified
  via e.g.  an afterRestore() callback.

- Upon being notified of a coming checkpoint, a resource is responsible for
  destroying any state that may prevent the capturing of a checkpoint (e.g.
  close any objects that it is responsible and that may keep open file
  descriptors), as well as for capturing whatever information it may need
  in order to continue successfully after a restore (e.g. the knowledge of
  what needs to be "opened" before a restore is complete).

- A resource may cause a checkpoint attempt to fail by throwing an exception
  when notified.

- Upon being notified that a restore has occurred, a resource is responsible
  for any required restoration or recreation of the state that it destroyed
  before the checkpoint occurred. [e.g. opening, binding, listening, and
  possibly selecting on server ports that were closed for the checkpoint].
  Note that although restoration is not functionally required in some cases,
  it may still be beneficial for faster functional startup upon restoration.
  E.g. outbound connections in a connection pool may not have to be
  reconnected, as normal connection failure handling will likely deal with
  their re-establishment in any case. However, initiating such reconnection
  upon restore will likely improve functional startup time.

- A resource may indicate that a restore attempt should fail by throwing an
  exception when notified.

—————————————————
Prototype JDK implementation

The prototype JDK implementation [3] implements Coordinated Checkpoint and
Restore using a modified version of CRIU. A snapshot image of the JDK process
created at an arbitrary point of time, the image is later used to start a copy
of the process that is identical to the original one.
Hotspot change highlights:
- Adds a Coordinated Checkpoint and Restore implementation for Linux
    - the checkpoint is performed in a JVM safepoint
    - currently depends on being able to reuse the checkpointed process pid.
      [not a problem in containers]
- Adds a jcmd command for initiating Checkpoint (does not yet pass error
  information on failure)
- Enforces no java user-visible file or socket resources are allowed at the
  checkpoint time. Exception message indicates the problematic resource
  information.
- Changes in PerfMemory (/tmp/hsperfdata<user>/<pid>) to work across multiple
  restores
- Performs GC on checkpoint and zeros unused heap memory to minimize
  image size

JDK change highlights:
- a jdk.crac API providing Checkpoint and Restore notifications
- uses of the jdk.crac API within the JDK:
    - support in sun.nio.ch.EPollSelectorImpl to handle epoll and pipe
    - jar file handling by the JDK
    - support in java.net.PlainSocketImpl and sun.nio.ch.FileDispatcherImpl
      to handle internal socket used for preclose

—————————————————
Preliminary uses of CRaC API in some application frameworks

AKA: What modifying common application frameworks to use a proposed CRaC API
successfully on a prototype OpenJDK implementation looks like.

The CRaC API was used to create modified versions of Quarkus [6], Micrnoaut
[7] and Tomcat [8] (used by Spring Boot in our examples). The amount of code
changes required has been surprisingly small.

All three frameworks successfully coordinate checkpoint and restore
operations with the prototype JDK without requiring any changes to the
example code that runs on the framework.  It is hoped that a large majority
of applications that run on such frameworks would not require any CRaC API
use, and CRaC awareness will only be needed at the framework and potentially
at the library levels in most cases.

—————————————————
Promising Preliminary Results

The current prototype has demonstrated <50msec startup times [5] for fully
warmed microservice examples running on modified Spring Boot, Quarkus, and
Micronaut[4].

The examples demonstrate fully-JIT'ed performance out of the box:  the
immediate throughout of these <50msec starts matches the throughput achieved
by a normal OpenJDK start only after the latter has fully warmed up, and
after it had executed >10,000 example operations at significantly slower
speeds.

—————————————————
References

[1]  https://github.com/checkpoint-restore/criu
[2]  https://github.com/docker/cli/blob/master/experimental/checkpoint-restore.md
[3]  https://github.com/org-crac/jdk/compare/jdk-base..jdk-crac
[4]  https://github.com/org-crac/criu/compare/v3.14..crac
[5]  https://github.com/org-crac/docs#results
[6]  https://github.com/org-crac/quarkus/compare/master..crac-master
[7]  https://github.com/org-crac/micronaut-core/compare/1.3.x..crac-1.3.x
[8]  https://github.com/org-crac/tomcat/compare/8.5.x..crac
[9] https://en.wikipedia.org/wiki/Tubthumping