Question regarding the design rationale for handling file descriptors/network connections in CRaC
ma zhen
mz1999 at gmail.com
Thu Apr 10 09:30:04 UTC 2025
Hi CRaC developers,
I'm currently exploring the integration of CRaC support into our company's
middleware products. I'm also very interested in the underlying
implementation details of CRaC and have been doing some research into its
mechanics.
As I understand it, CRaC leverages CRIU under the hood for checkpointing
and restoring running processes. My research indicates that CRIU itself is
capable of handling open file descriptors and established network
connections during the checkpoint/restore cycle.
However, the CRaC API requires developers to explicitly manage these
resources, typically by closing them in the beforeCheckpoint() and
re-establishing them in the afterRestore().
To understand the rationale behind this design choice, I looked into the
initial CRaC prototype, specifically the first PR (
https://github.com/openjdk/crac/pull/1). It appears that even in this early
version, the implementation iterated through all process file descriptors
during checkpoint. It ignored certain FDs (like those related to classpath
files, /dev/random, /dev/urandom, and files marked M_PERSISTENT - though
I'm unclear on the exact meaning of M_PERSISTENT in this context). If any
other application-opened files remained, the checkpoint process would fail.
This suggests the requirement for manual resource management was present
from the outset.
As I'm not deeply familiar with JVM internals, I'm struggling to fully
grasp the reasoning. Was this restriction primarily introduced to simplify
the initial design and implementation of CRaC within the JVM?
I also noticed that current versions of CRaC include File Descriptor
Policies. These allow configuring an action: ignore for specific file
descriptors, effectively delegating their handling to CRIU. This seems to
demonstrate that letting CRIU manage certain open files is feasible within
the CRaC framework.
This leads me to wonder: if delegation to CRIU is possible and works (at
least for some cases via policies), why isn't relying on CRIU for resource
handling the default or more broadly encouraged approach? Why the strict
requirement for manual closure and reopening in the general case?
For instance, consider using System.getLogger() from the JDK Platform
Logging API. As application developers, we don't typically manage the
underlying file descriptor for the log file directly. To make this work
with CRaC, we currently need to identify and configure a File Descriptor
Policy for it, which can feel somewhat cumbersome. Wouldn't a smoother
experience involve CRaC (perhaps optionally) defaulting to letting CRIU
handle such internally managed resources, like those opened by standard JDK
libraries?
I would appreciate any insights or clarification you could offer on the
design philosophy behind CRaC's approach to managing external resources
like files and sockets, especially in contrast to CRIU's capabilities.
Thanks for your time and any insights you can share.
Best regards,
mazhen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20250410/65c3435c/attachment.htm>
More information about the crac-dev
mailing list