Call for Discussion: New Project: CRaC
Anton Kozlov
akozlov at azul.com
Wed Jul 21 17:09:21 UTC 2021
I think for CRIU such checks are required since it works with an unaware
process. Any change in the environment could be disastrous if the process has
done decisions in the past that are incompatible with the new file contentt.
File size, access modes are checked, modification time too, IIRC. Content of
files could also be checked, but I suppose it is just too expensive.
For CRaC so far we require no open files at the checkpoint (so CRIU or any
other checkpoint mechanism is not bothered with them). On restore, you have to
reopen necessary files, handling all problems along. But if you have an open
file, the checkpoint is aborted with an exception. You cannot have an image
that may be OK or maybe not if e.g. some java code is not ready to handle a
missing file on the restore. So the checkpoint should be successful only if
your application state is consistent with any possible future execution
environment. That is, the application should not assume anything about the
environment at the time of the checkpoint, for the latter to succeed.
Thanks,
Anton
On 7/21/21 3:08 PM, Michael Bien wrote:
> I remember from discussions on #criu, CRIU interestingly doesn't actually care that much if a file changed between cp and restore. It has the rudimentary check comparing file size (!) mostly only for the scenario where a loaded lib changed, causing seg faults on restore, in best case.
>
> Maybe criu could skip changed files instead of failing and notify the JVM. The JVM would apply black magic and let IO streams throw IOE on next read or write - giving the application a chance to recover even without knowing of a CRaC-API. There might be situations where throwing an IOE isn't even needed.
>
> just a thought i don't know if any of this is doable,
>
> best regards,
> michael
>
> On 21.07.21 12:36, Anton Kozlov wrote:
>> Hi Michael,
>>
>> Interesting links!
>>
>> CRIU project did a terrific job in checkpointing and restoring an arbitrary
>> process.
>>
>> But if we think about how to continue the execution of the saved Java runtime
>> instance, multiple times simultaneously, the examples are what we should do
>> better. The internal state of runtime, standard library, or application (like
>> a crypto random seed) needs fixing after the restore. External resources could
>> not always be captured. These are files for a process-based checkpoint or
>> network connections for VM-based snapshotting.
>>
>> JFR is a good point and more such changes will likely appear over time. CRaC
>> handles perfdata temp file, which is used to implement jcmd and jps
>> functionality. Without special care on the JVM side, the missing perfdata will
>> likely prevent the second restore with CRIU (the first restored instance
>> deletes the file as not needed) or the restore after reboot.
>>
>> The Logger example is invaluable to demonstrate why coordination is needed.
>> Without knowledge about semantic, it's impossible to distinguish between e.g. a
>> log file, previous content of which is not important, and a config file, which
>> should be re-read after restore. So automatic handling of files does not seem
>> possible in general. Some convenience can be implemented (like automatic log
>> rotation), but this needs to be done with the awareness of the semantic and
>> should allow error handling on the Java application side. We require to
>> re-acquire resources at the restore and allow such code to throw exceptions.
>>
>> Thanks,
>> Anton
>>
>> On 7/20/21 6:31 PM, Michael Bien wrote:
>>> Hello,
>>>
>>> great to hear that there is research done in this area.
>>>
>>> I did some experimenting myself by just binding to the CRIU C-API via panama some time ago[1][2]. It quickly became clear that, although it worked surprisingly well, it probably required a lower level approach to properly implement it. (I was mostly interested in CRIUs rootless mode[3] and restoring warmed up JVMs, which came with its own issues and kernel bugs)
>>>
>>> Checkpointing the JVM is probably much safer when all threads have stopped and more economical when the heap is compacted - the JVM itself is in a better position to do that than the java application.
>>>
>>> CRIU can't deal with situations when files changed between checkpoint and restore. Restoring a java program which is logging to a file will only work once, a second attempt would fail since the file changed due to the first restore. An API might be able to mitigate a lot of this, e.g a logger could rotate the log to a empty file, or close the file on checkpoint an reopen it on restore. JFR should do this out of the box. I was wondering if the IO stream impl itself could help in some situations.
>>>
>>> Non-file related APIs might have to be made restore-aware too. For example SecureRandom might require re-seeding, keystores/SSL certs might need special attention etc.
>>>
>>>
>>> although it worked surprisingly well (restoring was also quite fast), implementing it at the java application level would be fairly limited. Looking forward to hear/see more from CRaC!
>>>
>>> best regards,
>>> michael
>>>
>>> [1] https://github.com/mbien/JCRIU/
>>> [2] https://mbien.dev/blog/entry/java-and-rootless-criu-using
>>> [3] https://github.com/checkpoint-restore/criu/pull/1155
>>>
>>> On 18.07.21 16:48, Anton Kozlov wrote:
>>>> Hi,
>>>>
>>>> It's been a while since we presented Coordinated Restore at Checkpoint for the
>>>> first time [0]. We are still committed to the idea and researching this topic.
>>>>
>>>> Java applications can avoid the long start-up and warm-up by saving the state
>>>> of the Java runtime (snapshot, checkpoint). The saved state is then used to
>>>> start instances fast (restored). But after the state was saved, the execution
>>>> environment could change. Also, if multiple instances are started from the
>>>> saved state simultaneously, they should obtain some uniqueness, and their
>>>> executions should diverge at some point.
>>>>
>>>> We believe that the practical way to solve these problems is to make Java
>>>> applications aware of when the state is saved and restored. Then an
>>>> application will be able to handle environmental changes. The application will
>>>> also be able to obtain uniqueness from the environment.
>>>>
>>>> The CRaC project aims to research Java API for coordination between application
>>>> and runtime to save and restore the state. Runtime should support multiple
>>>> ways to save the state: virtual machine snapshot, container snapshot, CRIU
>>>> project on Linux, etc. We hope to come with an API that is general enough for
>>>> any underlying mechanism. We also plan to explore safety checks in the API and
>>>> runtime, which prevent saving the state if it may not be restored or work
>>>> correctly after the restore.
>>>>
>>>> I propose myself as a Project Lead of the CRaC Project. If you're interested
>>>> or want to be the committer, please drop me a message.
>>>>
>>>> A fork of JDK [1] would be a starting point of this project.
>>>>
>>>> Thanks,
>>>> Anton
>>>>
>>>> [0] https://mail.openjdk.java.net/pipermail/discuss/2020-September/005594.html
>>>> [1] https://github.com/CRaC/jdk
>>>>
>>>
>
More information about the discuss
mailing list