CRaC + maven-daemon: An experience report

Wed Jun 8 14:08:39 UTC 2022

>
> > we picked up maven-daemon and tried to use CRaC to improve the start
> > up time of the daemon processes.

> Cool! Did that pay off?

Yes, it definitely improved the startup time of the daemon process.
I haven't measured the actual improvement, but it is very apparent - almost
instant startup on restore.

Checkpoint after a few client invocations may provide better results,
> e.g. at the daemon termination.
>

Difficult to quantify "few" here without doing some performance
measurements.
So for PoC I took the snapshot after the first request, which seems to
provide considerable startup improvement.

> It will be interesting to look at the changes, to get a better
> understanding of the implementations of the two approaches.
>

Sorry, the links didn't work in the earlier document. We have fixed that
now. You should be able to check out the changesets from the links in the
doc.

Have you considered making the code
> "lower level" than Approach 1, e.g. to modify server socket abstraction
> in the app, if any, to hide the details of the handling w.r.t.
> checkpoint and restore.
>

Unfortunately it does not use any abstraction over ServerSocketChannel. So
the Server class has to take care of the channel on checkpoint-restore
events.

The PID cannot change during
> execution right now, but the javadoc does not explicitly state that.  I
> think it's worth clarifying and seeking the consensus in the broader
> OpenJDK community.
>

It really depends on the state captured in the checkpoint image. If the
checkpoint only captures the Java program state, then the effect of
changing the PID is within the boundary
of the jvm/jdk and can be taken care of. But in case of CRIU, the whole
process state is serialized which includes native libraries, OS resources
and what not. They may not play well if PID is changed.

Regards,
Ashutosh Mehra

On Wed, Jun 8, 2022 at 7:06 AM Anton Kozlov <akozlov at azul.com> wrote:

> On 6/7/22 19:05, Ashutosh Mehra wrote:
> > we picked up maven-daemon and tried to use CRaC to improve the start
> > up time of the daemon processes.
>
> Cool! Did that pay off?
>
> > [Checkpoint] is done by the first daemon process started by a
> client. After performing the requested build, it takes the checkpoint
> and exits.
>
> Checkpoint after a few client invocations may provide better results,
> e.g. at the daemon termination.
>
> > There are two options depending on the execution model.  You can read
> > more about the execution models with CRaC in the blog post on
> > phase-aware source code.  1. In the first approach, the execution on
> > restore continues from the point where the checkpoint was taken.  2.
> > In the second approach, the execution on restore starts from an an
> > initialized image but as though the MavenDaemon’s main() was being
> > invoked anew.
>
> CRaC does not embed these two modes. Apps may adopt these or similar,
> but the relation of the modes is like a design pattern like the
> singleton can be implemented with Java lang, so one of these modes can
> be implemented with CRaC.
>
> The execution in the CRaC always continues from the checkpoint. You may
> make it exit as soon as restored and therefore provide option 2.  So,
> these options are parts of the spectrum, rather than a complete set of
> alternatives.
>
> In Approach 2:
> > This also allows us to shift the time of checkpoint after the Server
> > instance has been closed normally Although this change is enough to
> > use checkpoint-restore correctly, it does not produce the expected
> > benefits in daemon startup time.
>
> It will be interesting to look at the changes, to get a better
> understanding of the implementations of the two approaches.
>
> It looks in Approach 2 you've handled components of the application,
> rather than e.g object fields in Approach 1.  Getting the code for CRaC
> "higher level", handling bigger parts of an application may simplify
> coding and also reduce the benefit.  Have you considered making the code
> "lower level" than Approach 1, e.g. to modify server socket abstraction
> in the app, if any, to hide the details of the handling w.r.t.
> checkpoint and restore. I assume this may make that abstraction more
> complex, but simplify using that abstraction.  If the abstraction
> becomes general enough, then it can be considered to be included in CRaC
> JDK.
>
> > Snapsafety
>
> I still struggle to understand what is it. Is it a property of the code
> (e.g.  if you use these classes, you are safe w.r.t. checkpoint and
> restore and don't need to coordinate explicitly)? Or is it a property of
> the state (the state can be safely checkpoint and restore -- what is
> safe in this case)?
>
> > Each checkpoint wants to use the same PID when restored.
>
> Two problems here. Current implementation indeed does not allow changing
> PID, but it is possible to some extent.  The second problem, therefore,
> is that expected by the java code.  The PID cannot change during
> execution right now, but the javadoc does not explicitly state that.  I
> think it's worth clarifying and seeking the consensus in the broader
> OpenJDK community.
>
>