Call for Discussion: New Project: CRaC

Wed Jul 28 19:46:30 UTC 2021

On Wed, 28 Jul 2021 at 21:46, Anton Kozlov <akozlov at azul.com> wrote:

> Hi Ruslan,
>
> On 7/28/21 11:16 AM, Ruslan Synytsky wrote:
> > Tech question: what do you think about the need to adjust the heap size
> > after restoration from a checkpointed runtime? As I understand, in some
> > cases, the restored runtimes may need different heap size compared to the
> > initial runtime from which the state was saved. There is a JEP
> > https://openjdk.java.net/jeps/8204088 that might be relevant to this
> > discussion.
>
>
> Before saving the state, in GCs where it was easy to implement, we uncommit
> unused parts of the heap. In other cases (except ZGC), we re-commit parts
> of
> the heap, so the RSS is still low. The driver was to avoid saving garbage,
> but
> this also makes RSS size equal to the size of the live set of the heap.
> Coordination with checkpoint/restore mechanism includes JVM, so there is a
> trigger to give up resources that may be unneeded after restoring the
> state.
>
Hi Anton, very good, it sounds like a major use case for the memory
uncommit improvements that were introduced in different GCs.

>
> Resizing the heap, in general, seems to be a lot of effort. However, if the
> implementation for the enhancement would exist, it likely could be reused
> for
> what we need and do.
>
Thank you for the confirmation. While it's not a blocker, in my opinion the
ability to adjust the Xmx after restoring the state with respect to the new
environment requirements will unlock even more outcome from CRaC and reduce
overhead on the orchestration level. It's much more flexible and cost
efficient to have one state template which can be restored with different
memory limits compared to storing multiple identical templates with a
possible variety of heap sizes. Indeed, it's a complex topic, but as a good
news, Rodrigo Bruno (cc'd) has implemented a working prototype for changing
Xmx on the fly. And it did not cost too much effort yet, at least for now.
So, we have made some progress on this well.

Also, it will be useful to get a better understanding of what can be
improved outside of JVM, on the container level. It may help us to
avoid fixing the issues that should be resolved in general or can be
resolved easily on the underlying level. Any thoughts and ideas on this are
welcome as well.

Regards

> Thanks,
> Anton
>

-- 
Ruslan Synytsky
CEO @ Jelastic Multi-Cloud PaaS