CFV: New Project: CRaC

Thu Aug 5 19:31:55 UTC 2021

On 8/5/21 2:56 PM, Alan Bateman wrote:
> On 05/08/2021 10:27, Andrew Dinn wrote:
>>
>> Specifically, I believe the summary below fails to highlight the critical need for a significant amount of code in the JDK runtime and JVM to be involved in receiving and handling checkpoint and restore events and, in direct consequence, being party to the the process that saves or constructs a restartable runtime state (obvious areas include network and file system i/o management, memory management, security management, providers of randomness -- i.e. the same areas where GraalVM has found it needs to enforce runtime initialization or runtime repair of build-time inited state). My concern is that without a close involvement of many existing JDK and JVM engineers in the project and a strong commitment from them to supporting it the project is very likely to fail. I don't believe we have met either of those two requirements yet.
> 
> I didn't get time to reply to the initial discussion but I share the concern that the changes are potentially invasive and will require auditing and work in many areas. There was discussion about this at FOSDEM and at least one OCW where concerns about security and other areas came up. For example, I remember at FOSDEM (I think after Christine Flood presented on CRIU) there was discussion about session keys and needing to have those to be invalidated in the checkpoint on disk and a complete re-initialization at restore. There was also discussion about the implications of adjusting the clock and the impact of re-connecting or invalidating file descriptors. I don't doubt that all challenges are solvable with effort but it does require a lot of components and areas to cooperate. So I think your comment on trying to get a wider set of contributors (those working on core and security libraries for example) is important.

The current proof-of-concept implementation has a rather low footprint.   Now
we do not require too much from the rest of the OpenJDK.  Although PoC may be
rewritten completely, it can serve for the estimation of the amount of code
needed.  Some parts of the state (networking and files) intentionally are not
saved by the implementation and for now, it is going to be this way: it's the
external state that may change -- applications are forced to handle it by
themselves.  For a Random instance, it is required to know its semantic, so we
also cannot handle them completely automatically.

For the security keys use case, it is important to understand who "owns" keys.
If it is the JDK standard library, then proper handling should be implemented
there.

 From my experience, fixing a stale application state is rather easy and needs
only a bare refactoring (a reinitializing function based on the existing
initializing function).  So I understand the main concern about auditing
OpenJDK code is that some parts could be just overlooked.  I think proper
testing should help here (while the code is still in the Project).

Unlikely the project is going to be integrated once and completely.  So the
plan was to provide the best implementation for various areas in OpenJDK by the
efforts of people enthusiastic about the Project.  On important milestones, the
project will seek for review of the corresponding parts of the implementation
from experts.  Without a base implementation, it will be hard to discuss the
details.

Thanks,
Anton