Question regarding the design rationale for handling file descriptors/network connections in CRaC

Fri Apr 11 07:17:16 UTC 2025

Hi Ma Zhen,

you have correctly observed that closing file descriptors is rather an 
architectural choice than purely a technical need. CRIU is really 
capable of restoring the process as-is, as its main motivation is 
migration of running containers. Containers already define the 
filesystem, and the runtime is in control of external connections - e.g. 
CRIU can checkpoint and later restore an open socket connection, and the 
container runtime restores the 'second half' of the socket so that the 
pause is transparent to the running process.

If this is what you want, there's nothing preventing you from using CRIU 
on a Java process manually - at the risk of breaking the internal logic 
of the application. However the point of CRaC is not such a transparent 
restore: we want to preserve the valuable state of JVM and application 
but adapt it to the new environment. We want to do a conscious decision 
about any resource external to the process. Being forced to gracefully 
adapt to the restore is a feature.

Yes, we have File Descriptor policies, but that's not a solution - it 
provides a workaround for proof-of-concepts, until some code that you 
can't easily fix gets updated to support CRaC properly. Ideas meet 
practicality, and you are responsible for realizing what should be done 
with particular external resource.

You're right that ATM we don't handle JDK Platform Logging (and neither 
JUL) configured to write to a file, and since that is JDK code out of 
user control it is a bug. We attempt to fix those one by one (PRs are 
welcome!).

I hope I have provided some insight to these choices - and yes, I 
understand the pain as we still have many places to fix.

Cheers,

Radim

On 10. 04. 25 11:30, ma zhen wrote:
>
> 	
> Caution: This email originated from outside of the organization. Do 
> not click links or open attachments unless you recognize the sender 
> and know the content is safe.
>
>
> Hi CRaC developers,
>
> I'm currently exploring the integration of CRaC support into our 
> company's middleware products. I'm also very interested in the 
> underlying implementation details of CRaC and have been doing some 
> research into its mechanics.
>
> As I understand it, CRaC leverages CRIU under the hood for 
> checkpointing and restoring running processes. My research indicates 
> that CRIU itself is capable of handling open file descriptors and 
> established network connections during the checkpoint/restore cycle.
>
> However, the CRaC API requires developers to explicitly manage these 
> resources, typically by closing them in the beforeCheckpoint() and 
> re-establishing them in the afterRestore().
>
> To understand the rationale behind this design choice, I looked into 
> the initial CRaC prototype, specifically the first PR 
> (https://github.com/openjdk/crac/pull/1). It appears that even in this 
> early version, the implementation iterated through all process file 
> descriptors during checkpoint. It ignored certain FDs (like those 
> related to classpath files, /dev/random, /dev/urandom, and files 
> marked M_PERSISTENT - though I'm unclear on the exact meaning of 
> M_PERSISTENT in this context). If any other application-opened files 
> remained, the checkpoint process would fail. This suggests the 
> requirement for manual resource management was present from the outset.
>
> As I'm not deeply familiar with JVM internals, I'm struggling to fully 
> grasp the reasoning. Was this restriction primarily introduced to 
> simplify the initial design and implementation of CRaC within the JVM?
>
> I also noticed that current versions of CRaC include File Descriptor 
> Policies. These allow configuring an action: ignore for specific file 
> descriptors, effectively delegating their handling to CRIU. This seems 
> to demonstrate that letting CRIU manage certain open files is feasible 
> within the CRaC framework.
>
> This leads me to wonder: if delegation to CRIU is possible and works 
> (at least for some cases via policies), why isn't relying on CRIU for 
> resource handling the default or more broadly encouraged approach? Why 
> the strict requirement for manual closure and reopening in the general 
> case?
>
> For instance, consider using System.getLogger() from the JDK Platform 
> Logging API. As application developers, we don't typically manage the 
> underlying file descriptor for the log file directly. To make this 
> work with CRaC, we currently need to identify and configure a File 
> Descriptor Policy for it, which can feel somewhat cumbersome. Wouldn't 
> a smoother experience involve CRaC (perhaps optionally) defaulting to 
> letting CRIU handle such internally managed resources, like those 
> opened by standard JDK libraries?
>
> I would appreciate any insights or clarification you could offer on 
> the design philosophy behind CRaC's approach to managing external 
> resources like files and sockets, especially in contrast to CRIU's 
> capabilities.
>
> Thanks for your time and any insights you can share.
>
> Best regards,
>
> mazhen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20250411/9a477410/attachment-0001.htm>