Workshop topic: Java on CRaC: coordinated instant start

Sat Jun 15 07:32:42 UTC 2019

We are likely talking about very different things here. I had spoken with Christine at length at jFokus about her work with checkpointing, and shared soneddetsils of our work as well. I don’t believe this overlaps much with what she has been presenting on, and the field is pretty wide, with room for lots of work and ideas from lots of people. This is one.

I plan to talk about a new API and behavior in java (e.g. to be added to a future OpenJDK version via a JEP and to be included in a future Java spec via a JSR) that would allow applications and application frameworks to coordinate with a checkpointing mechanism in the underlying platform. The coordination would focus on things like getting rid of “problematic” external state going into a checkpoint and recreating un-captured state coming out of a checkpoint. The way the checkpoint state itself is captured and managed is orthogonal to the subject at hand: it could be captured by the runtime itself, by a CRIU or equivalent on e.g. windows, by a container system like Docker or k8s performing container checkpoints, etc. etc... it is the application semantics and the APIs needed for coordination that this talk will focus on, as well as on what successful use of such coordination APIs in e.g. tomcat/etc. can achieve.

We’ve been working on variant forms of partial and complete check pointing for years now, including multiple different use modes, and have built up a taxonomy for some of them. Some modes are transparent (e.g. CRaM for Checkpoint / Resume at Main) while others are not. Some deal with checkpointing specific state (e.g. profiles, class data and metadata, compiled code and code cache) while others deal with wider (e.g. all or nearly all process memory contents, and some may even runtime-external state like file handles to files that reside within an immutable image). The specific CRaC use mode is an intentionally non-transparent (but rather coordinated) mode aimed at addressing a specific (and we think very common) use case of rolling out new code in e.g. DevOps workflows.

Sent from Gil's iPhone

> On Jun 14, 2019, at 1:27 AM, Andrew Haley <aph at redhat.com> wrote:
> 
>> On 6/14/19 4:13 AM, Gil Tene wrote:
>> Abstract:
>> We propose adding a new Checkpoint / Restore-at-Checkpoint (CRaC)
>> capability to OpenJDK, supported by a simple and robust API that
>> would ensure applications and (most importantly) application
>> frameworks are able to safely coordinate the checkpointing process
>> and the state restoration activities needed to achieve near-instant-start
>> of fully warmed application instances. Real code, examples, and
>> actual numbers will be discussed.
>> 
>> Possible question for session participants to address:
>> 
>> Q: What resources need coordination via this API?
>> 
>> Q: What frameworks constitute a good critical set for
>>     common use cases (e..g tomcat + JDBC connection pool, and…)
>> 
>> Q: What would it take for frameworks to start writing
>>     to such an API?
> 
> Christine Flood has been working on this for some time, and will be
> presenting at conferences. I think you should co-ordinate with her.
> 
> -- 
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> https://keybase.io/andrewhaley
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671