Call for Discussion: New Project: CRaC
Anton Kozlov
akozlov at azul.com
Wed Jul 21 10:36:41 UTC 2021
Hi Michael,
Interesting links!
CRIU project did a terrific job in checkpointing and restoring an arbitrary
process.
But if we think about how to continue the execution of the saved Java runtime
instance, multiple times simultaneously, the examples are what we should do
better. The internal state of runtime, standard library, or application (like
a crypto random seed) needs fixing after the restore. External resources could
not always be captured. These are files for a process-based checkpoint or
network connections for VM-based snapshotting.
JFR is a good point and more such changes will likely appear over time. CRaC
handles perfdata temp file, which is used to implement jcmd and jps
functionality. Without special care on the JVM side, the missing perfdata will
likely prevent the second restore with CRIU (the first restored instance
deletes the file as not needed) or the restore after reboot.
The Logger example is invaluable to demonstrate why coordination is needed.
Without knowledge about semantic, it's impossible to distinguish between e.g. a
log file, previous content of which is not important, and a config file, which
should be re-read after restore. So automatic handling of files does not seem
possible in general. Some convenience can be implemented (like automatic log
rotation), but this needs to be done with the awareness of the semantic and
should allow error handling on the Java application side. We require to
re-acquire resources at the restore and allow such code to throw exceptions.
Thanks,
Anton
On 7/20/21 6:31 PM, Michael Bien wrote:
> Hello,
>
> great to hear that there is research done in this area.
>
> I did some experimenting myself by just binding to the CRIU C-API via panama some time ago[1][2]. It quickly became clear that, although it worked surprisingly well, it probably required a lower level approach to properly implement it. (I was mostly interested in CRIUs rootless mode[3] and restoring warmed up JVMs, which came with its own issues and kernel bugs)
>
> Checkpointing the JVM is probably much safer when all threads have stopped and more economical when the heap is compacted - the JVM itself is in a better position to do that than the java application.
>
> CRIU can't deal with situations when files changed between checkpoint and restore. Restoring a java program which is logging to a file will only work once, a second attempt would fail since the file changed due to the first restore. An API might be able to mitigate a lot of this, e.g a logger could rotate the log to a empty file, or close the file on checkpoint an reopen it on restore. JFR should do this out of the box. I was wondering if the IO stream impl itself could help in some situations.
>
> Non-file related APIs might have to be made restore-aware too. For example SecureRandom might require re-seeding, keystores/SSL certs might need special attention etc.
>
>
> although it worked surprisingly well (restoring was also quite fast), implementing it at the java application level would be fairly limited. Looking forward to hear/see more from CRaC!
>
> best regards,
> michael
>
> [1] https://github.com/mbien/JCRIU/
> [2] https://mbien.dev/blog/entry/java-and-rootless-criu-using
> [3] https://github.com/checkpoint-restore/criu/pull/1155
>
> On 18.07.21 16:48, Anton Kozlov wrote:
>> Hi,
>>
>> It's been a while since we presented Coordinated Restore at Checkpoint for the
>> first time [0]. We are still committed to the idea and researching this topic.
>>
>> Java applications can avoid the long start-up and warm-up by saving the state
>> of the Java runtime (snapshot, checkpoint). The saved state is then used to
>> start instances fast (restored). But after the state was saved, the execution
>> environment could change. Also, if multiple instances are started from the
>> saved state simultaneously, they should obtain some uniqueness, and their
>> executions should diverge at some point.
>>
>> We believe that the practical way to solve these problems is to make Java
>> applications aware of when the state is saved and restored. Then an
>> application will be able to handle environmental changes. The application will
>> also be able to obtain uniqueness from the environment.
>>
>> The CRaC project aims to research Java API for coordination between application
>> and runtime to save and restore the state. Runtime should support multiple
>> ways to save the state: virtual machine snapshot, container snapshot, CRIU
>> project on Linux, etc. We hope to come with an API that is general enough for
>> any underlying mechanism. We also plan to explore safety checks in the API and
>> runtime, which prevent saving the state if it may not be restored or work
>> correctly after the restore.
>>
>> I propose myself as a Project Lead of the CRaC Project. If you're interested
>> or want to be the committer, please drop me a message.
>>
>> A fork of JDK [1] would be a starting point of this project.
>>
>> Thanks,
>> Anton
>>
>> [0] https://mail.openjdk.java.net/pipermail/discuss/2020-September/005594.html
>> [1] https://github.com/CRaC/jdk
>>
>
More information about the discuss
mailing list