Question regarding the design rationale for handling file descriptors/network connections in CRaC
ma zhen
mz1999 at gmail.com
Fri Apr 11 08:57:35 UTC 2025
Hi Radim,
Thanks a lot for the detailed explanation! That completely cleared up my
understanding of the design philosophy behind CRaC.
It makes perfect sense now that the goal isn't purely transparent
restoration, but rather preserving the valuable internal JVM/application
state while enabling robust adaptation to the new environment after restore
– sacrificing some transparency for resilience by consciously managing
external resources.
Great project, and I appreciate the insight. Hope to be able to contribute
down the line!
Cheers,
Ma Zhen
Radim Vansa <rvansa at azul.com> 于2025年4月11日周五 15:17写道:
> Hi Ma Zhen,
>
> you have correctly observed that closing file descriptors is rather an
> architectural choice than purely a technical need. CRIU is really capable
> of restoring the process as-is, as its main motivation is migration of
> running containers. Containers already define the filesystem, and the
> runtime is in control of external connections - e.g. CRIU can checkpoint
> and later restore an open socket connection, and the container runtime
> restores the 'second half' of the socket so that the pause is transparent
> to the running process.
>
> If this is what you want, there's nothing preventing you from using CRIU
> on a Java process manually - at the risk of breaking the internal logic of
> the application. However the point of CRaC is not such a transparent
> restore: we want to preserve the valuable state of JVM and application but
> adapt it to the new environment. We want to do a conscious decision about
> any resource external to the process. Being forced to gracefully adapt to
> the restore is a feature.
>
> Yes, we have File Descriptor policies, but that's not a solution - it
> provides a workaround for proof-of-concepts, until some code that you can't
> easily fix gets updated to support CRaC properly. Ideas meet practicality,
> and you are responsible for realizing what should be done with particular
> external resource.
>
> You're right that ATM we don't handle JDK Platform Logging (and neither
> JUL) configured to write to a file, and since that is JDK code out of user
> control it is a bug. We attempt to fix those one by one (PRs are welcome!).
>
> I hope I have provided some insight to these choices - and yes, I
> understand the pain as we still have many places to fix.
>
> Cheers,
>
> Radim
> On 10. 04. 25 11:30, ma zhen wrote:
>
>
> Caution: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> Hi CRaC developers,
>
> I'm currently exploring the integration of CRaC support into our company's
> middleware products. I'm also very interested in the underlying
> implementation details of CRaC and have been doing some research into its
> mechanics.
>
> As I understand it, CRaC leverages CRIU under the hood for checkpointing
> and restoring running processes. My research indicates that CRIU itself is
> capable of handling open file descriptors and established network
> connections during the checkpoint/restore cycle.
>
> However, the CRaC API requires developers to explicitly manage these
> resources, typically by closing them in the beforeCheckpoint() and
> re-establishing them in the afterRestore().
>
> To understand the rationale behind this design choice, I looked into the
> initial CRaC prototype, specifically the first PR (
> https://github.com/openjdk/crac/pull/1). It appears that even in this
> early version, the implementation iterated through all process file
> descriptors during checkpoint. It ignored certain FDs (like those related
> to classpath files, /dev/random, /dev/urandom, and files marked
> M_PERSISTENT - though I'm unclear on the exact meaning of M_PERSISTENT in
> this context). If any other application-opened files remained, the
> checkpoint process would fail. This suggests the requirement for manual
> resource management was present from the outset.
>
> As I'm not deeply familiar with JVM internals, I'm struggling to fully
> grasp the reasoning. Was this restriction primarily introduced to simplify
> the initial design and implementation of CRaC within the JVM?
>
> I also noticed that current versions of CRaC include File Descriptor
> Policies. These allow configuring an action: ignore for specific file
> descriptors, effectively delegating their handling to CRIU. This seems to
> demonstrate that letting CRIU manage certain open files is feasible
> within the CRaC framework.
>
> This leads me to wonder: if delegation to CRIU is possible and works (at
> least for some cases via policies), why isn't relying on CRIU for resource
> handling the default or more broadly encouraged approach? Why the strict
> requirement for manual closure and reopening in the general case?
>
> For instance, consider using System.getLogger() from the JDK Platform
> Logging API. As application developers, we don't typically manage the
> underlying file descriptor for the log file directly. To make this work
> with CRaC, we currently need to identify and configure a File Descriptor
> Policy for it, which can feel somewhat cumbersome. Wouldn't a smoother
> experience involve CRaC (perhaps optionally) defaulting to letting CRIU
> handle such internally managed resources, like those opened by standard JDK
> libraries?
>
> I would appreciate any insights or clarification you could offer on the
> design philosophy behind CRaC's approach to managing external resources
> like files and sockets, especially in contrast to CRIU's capabilities.
>
> Thanks for your time and any insights you can share.
>
> Best regards,
>
> mazhen
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20250411/27a6abbe/attachment.htm>
More information about the crac-dev
mailing list