CRaC + maven-daemon: An experience report

Wed Jun 8 11:06:27 UTC 2022

On 6/7/22 19:05, Ashutosh Mehra wrote:
> we picked up maven-daemon and tried to use CRaC to improve the start
> up time of the daemon processes.

Cool! Did that pay off?

> [Checkpoint] is done by the first daemon process started by a
client. After performing the requested build, it takes the checkpoint
and exits.

Checkpoint after a few client invocations may provide better results,
e.g. at the daemon termination.

> There are two options depending on the execution model.  You can read
> more about the execution models with CRaC in the blog post on
> phase-aware source code.  1. In the first approach, the execution on
> restore continues from the point where the checkpoint was taken.  2.
> In the second approach, the execution on restore starts from an an
> initialized image but as though the MavenDaemon’s main() was being
> invoked anew.

CRaC does not embed these two modes. Apps may adopt these or similar,
but the relation of the modes is like a design pattern like the
singleton can be implemented with Java lang, so one of these modes can
be implemented with CRaC.

The execution in the CRaC always continues from the checkpoint. You may
make it exit as soon as restored and therefore provide option 2.  So,
these options are parts of the spectrum, rather than a complete set of
alternatives.

In Approach 2:
> This also allows us to shift the time of checkpoint after the Server
> instance has been closed normally Although this change is enough to
> use checkpoint-restore correctly, it does not produce the expected
> benefits in daemon startup time.

It will be interesting to look at the changes, to get a better
understanding of the implementations of the two approaches.

It looks in Approach 2 you've handled components of the application,
rather than e.g object fields in Approach 1.  Getting the code for CRaC
"higher level", handling bigger parts of an application may simplify
coding and also reduce the benefit.  Have you considered making the code
"lower level" than Approach 1, e.g. to modify server socket abstraction
in the app, if any, to hide the details of the handling w.r.t.
checkpoint and restore. I assume this may make that abstraction more
complex, but simplify using that abstraction.  If the abstraction
becomes general enough, then it can be considered to be included in CRaC
JDK.

> Snapsafety

I still struggle to understand what is it. Is it a property of the code
(e.g.  if you use these classes, you are safe w.r.t. checkpoint and
restore and don't need to coordinate explicitly)? Or is it a property of
the state (the state can be safely checkpoint and restore -- what is
safe in this case)?

> Each checkpoint wants to use the same PID when restored.

Two problems here. Current implementation indeed does not allow changing
PID, but it is possible to some extent.  The second problem, therefore,
is that expected by the java code.  The PID cannot change during
execution right now, but the javadoc does not explicitly state that.  I
think it's worth clarifying and seeking the consensus in the broader
OpenJDK community.