Call for Discussion: New Project: CRaC

Wed Jul 21 08:56:15 UTC 2021

Hi Christine,

For me, it also makes sense to combine efforts.

1) In CRaC, the heap is already shrunk at the checkpoint [1].  However, it
required minor changes in each GC, and ZGC is not covered yet.  The target of
this is optimization is solely image size.  As we wrote, we optimized image
loading size with page-in of the image data.

2) I'm not sure how much GC barriers cost for a small application, do you have
some data it is that high?  This use-case is definitely interesting, but this
optimization seems rather complex to implement.

As Volker correctly noted, coordination is required no matter if the checkpoint
request came externally or internally.

Actually, in CRaC there is an API for internal checkpoint request [2], that is
used to implement the external request via jcmd. In Tomcat Catalina example the
method is used as one of the starting modes [3], which saves Tomcat instance
right after it was initialized and is ready to serve requests.  Strictly
speaking, this is another kind of coordination, but one of the targets of the
project is to produce an API that is general enough for different use-cases and
mechanisms.

So I don't see a big difference in what we are trying to do, if we continue
exposing such kind of an API for internal checkpoint request.

Thanks,
Anton

[1] https://github.com/CRaC/jdk/compare/jdk-base..jdk-crac#diff-c3cf48d74cc5c7b6326bd3602e87cd7ea7277a5b856c7aa0940ec307f51f5281
[2] https://crac.github.io/jdk/jdk-crac/api/java.base/jdk/crac/Core.html#checkpointRestore()
[3] https://github.com/CRaC/tomcat/compare/release-crac-jdbc...crac#diff-2d3e1f08ceeedd89b50366360b3541dddba9f2b7b1602a2071ca6359ace4a62eR529

On 7/21/21 10:56 AM, Volker Simonis wrote:
> Hi Christine,
> 
> thanks for joining the discussion :)
> Please find my further comments inline.
> 
> On Tue, Jul 20, 2021 at 7:06 PM Christine Flood <chf at redhat.com> wrote:
>>
>> We at Red Hat have been working on this problem as well and I think now is
>> a great time to sync our efforts.
>>
>> Our current project, Jigawatts 1.21  is based on allowing the user to
>> specify precise checkpoints either by adding a method call, or manipulating
>> bytecodes via Byteman.
>> This code is separate from OpenJDK and will be distributed in it's own
>> Linux rpm.
>>
>> The next phase will require some changes to OpenJDK, specifically we are
>> looking to do some optimizations at checkpoint time to improve
>> startup/runtime.
>>
>> Here are two ideas.
>>
>> 1) Shrink the heap to just the live data size, this both guarantees that
>> there are no secrets hidden in garbage objects and minimizes restore time.
>>
>> We can restore and immediately grow the heap.
>>
>> 2) Hot swap garbage collectors, this allows us to give fast startup and
>> fast runtime by using the epsilon collector on restore, eliminating the
>> space for card table and time for gc barriers.  This will be particularly
>> useful for programs which wish to run fast small apps against an already
>> initialized data set.
>>
>> So, my question to you is does it make sense to combine these into one
>> effort, or do we want to keep the projects separate for now?  The efforts
>> are focusing in two different areas, specifically my understanding is that
>> CRAC wants to be able to checkpoint a JVM based on an external signal so at
>> any point in the runtime while Jigawatts is based more on user controlled
>> and JVM optimized checkpoints.
> 
>  From my point of view it makes sense to combine the efforts. I think
> CRAC should explore different ideas and directions (see my previous
> mail). One of them will be be how the JVM can implement and control
> checkpointing functionality. That's what your Jigawatts project is
> doing, but also what CRAC did in a POC based on CRIU.
> 
> The other direction CRAC should explore is how the JVM could react an
> externally triggered checkpointing events.
> 
> Finally, I think we need a mechanism exposed through a standard API
> which allows applications and frameworks to react on checkpointing
> events no matter if these events are triggered internally, by the JVM
> or externally. Such a mechanism is especially needed in situations
> where an applications is not simply suspended and resumed but also
> cloned several times after it was resumed (or resumed several times
> from the same checkpointed state).
> 
>>
>>
>> Christine
>>
>>
>>
>>
>>
>> On Sun, Jul 18, 2021 at 10:50 AM Anton Kozlov <akozlov at azul.com> wrote:
>>
>>> Hi,
>>>
>>> It's been a while since we presented Coordinated Restore at Checkpoint for
>>> the
>>> first time [0].  We are still committed to the idea and researching this
>>> topic.
>>>
>>> Java applications can avoid the long start-up and warm-up by saving the
>>> state
>>> of the Java runtime (snapshot, checkpoint).  The saved state is then used
>>> to
>>> start instances fast (restored).  But after the state was saved, the
>>> execution
>>> environment could change.  Also, if multiple instances are started from the
>>> saved state simultaneously, they should obtain some uniqueness, and their
>>> executions should diverge at some point.
>>>
>>> We believe that the practical way to solve these problems is to make Java
>>> applications aware of when the state is saved and restored.  Then an
>>> application will be able to handle environmental changes.  The application
>>> will
>>> also be able to obtain uniqueness from the environment.
>>>
>>> The CRaC project aims to research Java API for coordination between
>>> application
>>> and runtime to save and restore the state.  Runtime should support multiple
>>> ways to save the state: virtual machine snapshot, container snapshot, CRIU
>>> project on Linux, etc.  We hope to come with an API that is general enough
>>> for
>>> any underlying mechanism.  We also plan to explore safety checks in the
>>> API and
>>> runtime, which prevent saving the state if it may not be restored or work
>>> correctly after the restore.
>>>
>>> I propose myself as a Project Lead of the CRaC Project.  If you're
>>> interested
>>> or want to be the committer, please drop me a message.
>>>
>>> A fork of JDK [1] would be a starting point of this project.
>>>
>>> Thanks,
>>> Anton
>>>
>>> [0]
>>> https://mail.openjdk.java.net/pipermail/discuss/2020-September/005594.html
>>> [1] https://github.com/CRaC/jdk
>>>
>>>