Call for Discussion: New Project: CRaC

Michael Bien mbien42 at gmail.com
Wed Jul 21 12:08:13 UTC 2021


I remember from discussions on #criu, CRIU interestingly doesn't 
actually care that much if a file changed between cp and restore. It has 
the rudimentary check comparing file size (!) mostly only for the 
scenario where a loaded lib changed, causing seg faults on restore, in 
best case.

Maybe criu could skip changed files instead of failing and notify the 
JVM. The JVM would apply black magic and let IO streams throw IOE on 
next read or write - giving the application a chance to recover even 
without knowing of a CRaC-API. There might be situations where throwing 
an IOE isn't even needed.

just a thought i don't know if any of this is doable,

best regards,
michael

On 21.07.21 12:36, Anton Kozlov wrote:
> Hi Michael,
>
> Interesting links!
>
> CRIU project did a terrific job in checkpointing and restoring an 
> arbitrary
> process.
>
> But if we think about how to continue the execution of the saved Java 
> runtime
> instance, multiple times simultaneously, the examples are what we 
> should do
> better.  The internal state of runtime, standard library, or 
> application (like
> a crypto random seed) needs fixing after the restore.  External 
> resources could
> not always be captured.  These are files for a process-based 
> checkpoint or
> network connections for VM-based snapshotting.
>
> JFR is a good point and more such changes will likely appear over 
> time.  CRaC
> handles perfdata temp file, which is used to implement jcmd and jps
> functionality.  Without special care on the JVM side, the missing 
> perfdata will
> likely prevent the second restore with CRIU (the first restored instance
> deletes the file as not needed) or the restore after reboot.
>
> The Logger example is invaluable to demonstrate why coordination is 
> needed.
> Without knowledge about semantic, it's impossible to distinguish 
> between e.g. a
> log file, previous content of which is not important, and a config 
> file, which
> should be re-read after restore.  So automatic handling of files does 
> not seem
> possible in general.  Some convenience can be implemented (like 
> automatic log
> rotation), but this needs to be done with the awareness of the 
> semantic and
> should allow error handling on the Java application side.  We require to
> re-acquire resources at the restore and allow such code to throw 
> exceptions.
>
> Thanks,
> Anton
>
> On 7/20/21 6:31 PM, Michael Bien wrote:
>> Hello,
>>
>> great to hear that there is research done in this area.
>>
>> I did some experimenting myself by just binding to the CRIU C-API via 
>> panama some time ago[1][2]. It quickly became clear that, although it 
>> worked surprisingly well, it probably required a lower level approach 
>> to properly implement it. (I was mostly interested in CRIUs rootless 
>> mode[3] and restoring warmed up JVMs, which came with its own issues 
>> and kernel bugs)
>>
>> Checkpointing the JVM is probably much safer when all threads have 
>> stopped and more economical when the heap is compacted - the JVM 
>> itself is in a better position to do that than the java application.
>>
>> CRIU can't deal with situations when files changed between checkpoint 
>> and restore. Restoring a java program which is logging to a file will 
>> only work once, a second attempt would fail since the file changed 
>> due to the first restore. An API might be able to mitigate a lot of 
>> this, e.g a logger could rotate the log to a empty file, or close the 
>> file on checkpoint an reopen it on restore. JFR should do this out of 
>> the box. I was wondering if the IO stream impl itself could help in 
>> some situations.
>>
>> Non-file related APIs might have to be made restore-aware too. For 
>> example SecureRandom might require re-seeding, keystores/SSL certs 
>> might need special attention etc.
>>
>>
>> although it worked surprisingly well (restoring was also quite fast), 
>> implementing it at the java application level would be fairly 
>> limited. Looking forward to hear/see more from CRaC!
>>
>> best regards,
>> michael
>>
>> [1] https://github.com/mbien/JCRIU/
>> [2] https://mbien.dev/blog/entry/java-and-rootless-criu-using
>> [3] https://github.com/checkpoint-restore/criu/pull/1155
>>
>> On 18.07.21 16:48, Anton Kozlov wrote:
>>> Hi,
>>>
>>> It's been a while since we presented Coordinated Restore at 
>>> Checkpoint for the
>>> first time [0].  We are still committed to the idea and researching 
>>> this topic.
>>>
>>> Java applications can avoid the long start-up and warm-up by saving 
>>> the state
>>> of the Java runtime (snapshot, checkpoint).  The saved state is then 
>>> used to
>>> start instances fast (restored).  But after the state was saved, the 
>>> execution
>>> environment could change.  Also, if multiple instances are started 
>>> from the
>>> saved state simultaneously, they should obtain some uniqueness, and 
>>> their
>>> executions should diverge at some point.
>>>
>>> We believe that the practical way to solve these problems is to make 
>>> Java
>>> applications aware of when the state is saved and restored. Then an
>>> application will be able to handle environmental changes.  The 
>>> application will
>>> also be able to obtain uniqueness from the environment.
>>>
>>> The CRaC project aims to research Java API for coordination between 
>>> application
>>> and runtime to save and restore the state.  Runtime should support 
>>> multiple
>>> ways to save the state: virtual machine snapshot, container 
>>> snapshot, CRIU
>>> project on Linux, etc.  We hope to come with an API that is general 
>>> enough for
>>> any underlying mechanism.  We also plan to explore safety checks in 
>>> the API and
>>> runtime, which prevent saving the state if it may not be restored or 
>>> work
>>> correctly after the restore.
>>>
>>> I propose myself as a Project Lead of the CRaC Project.  If you're 
>>> interested
>>> or want to be the committer, please drop me a message.
>>>
>>> A fork of JDK [1] would be a starting point of this project.
>>>
>>> Thanks,
>>> Anton
>>>
>>> [0] 
>>> https://mail.openjdk.java.net/pipermail/discuss/2020-September/005594.html
>>> [1] https://github.com/CRaC/jdk
>>>
>>



More information about the discuss mailing list