CRaC: CheckpointException with file descriptors from JVM internals and native calls

Radim Vansa rvansa at azul.com
Tue Nov 18 18:26:34 UTC 2025


Hello ma zhen,

apologies for an untimely response.

In general, both FD policies and CRaCAllowedOpenFilePrefixes are really 
a workaround for apps that don't adhere to CRaC requirements, rather 
than a proper solutions. But let's talk about the problems individually:

1) When it comes to getAvailableProcessors() I think that opening the 
cgroups info is an implementation detail, and CRaC JVM should handle 
that transparently. There should be a hook (either in Java code or in 
native, whichever is less intrusive) that will make the file access and 
C/R mutually exclusive. We will gladly accept a PR (with a test case, 
please).

2) Listing files is an interaction with the environment, and application 
should stop that during C/R. Your observation about FD policies makes 
sense; in fact in this case there is no resource that could be linked 
into the FD policies; we would have to explicitly synchronize with C/R 
and that would be expensive on such a common function. From practical 
POV I understand that you can't easily modify the 3rd party library and 
I am glad that it works for you. Note though, 
that CRaCAllowedOpenFilePrefixes basically relies on C/R engine to 
handle that FD correctly. And if you attempt to restore on a system that 
does not host this directory, the restore will fail.

Technically the getAvailableProcessors() is also an interaction with the 
'environment', with the machine it is currently running, but the world 
is not black and white and my opinion is that this should be transparent.

Radim

On 11/14/25 09:01, ma zhen wrote:
>
> 	
> Caution: This email originated from outside of the organization. Do 
> not click links or open attachments unless you recognize the sender 
> and know the content is safe.
>
>
> Hi everyone,
>
> Following up on my own question, I believe I've found a suitable 
> solution and wanted to share it for the archives.
>
> The issue was resolved using the VM option 
> `-XX:CRaCAllowedOpenFilePrefixes`. This option lets you specify a 
> comma-separated list of path prefixes that CRaC should ignore if they 
> are found open during a checkpoint.
>
> (Reference: https://docs.azul.com/crac/usage/vm-options)
>
> Crucially, and what makes it a perfect solution for my original 
> problem, is that this option works for files opened by native code 
> (e.g., via JNI or internal JVM functions). This is why it can handle 
> the file descriptors that were not manageable through standard CRaC 
> resource policies.
>
> This directly addresses the two scenarios I described:
>
> 1. For the cgroup file opened by `OperatingSystemMXBean`, I can now add
>    `/sys/fs/cgroup/` to the allowed prefixes.
>
> 2. For the directory descriptor held open by the native implementation of
>    `File.list`, adding the application's base path works perfectly.
>
> This provides a much more robust solution than retrying the 
> checkpoint. I hope this is helpful for anyone else running into 
> similar issues.
>
> Best regards,
> mazhen
>
> ma zhen <mz1999 at gmail.com> 于2025年11月12日周三 17:29写道:
>
>     Hi everyone,
>
>     I'm encountering a CheckpointException when creating a checkpoint
>     image
>     with CRaC. The root cause is that the application holds file
>     descriptors
>     for files or directories.
>
>     Our application is quite complex, and after some investigation,
>     I've found
>     that these files/directories are being opened by third-party
>     libraries.
>     The challenge is that they are not opened through regular file I/O
>     APIs,
>     which makes it impossible to handle them using File Descriptor
>     Policies.
>
>     I've identified two specific scenarios:
>
>     1. A third-party library periodically fetches system resource
>     information,
>        which includes calling
>     `OperatingSystemMXBean.getAvailableProcessors`.
>
>        When the JVM determines the number of available CPU cores, if
>     it detects
>        that cgroups are available, it will read the resource limit file
>        `cpu.cfs_quota_us`, even if the process is not in a container.
>        The specific implementation logic can be found in
>     cgroupV1Subsystem_linux.cpp:
>      
>      (https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp)
>
>        If a checkpoint is triggered at this exact moment, an exception
>        similar to the following occurs:
>
>         Suppressed:
>     jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD
>     fd=57 type=regular
>     path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us
>             at
>     java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
>             at
>     java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
>             at
>     java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
>             at
>     java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
>
>     2. For some reason, a third-party library periodically calls
>     `File.list`
>        to get the list of files in a specific directory.
>
>        On Linux, the `list` method eventually calls the JNI method
>        `Java_java_io_UnixFileSystem_list` which holds a directory file
>        descriptor during its execution. This is defined in
>     UnixFileSystem_md.c:
>      
>      (https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c)
>
>        Similarly, if a checkpoint is triggered at this moment, an
>     exception
>        like the one below is thrown:
>
>     jdk.internal.crac.mirror.CheckpointException
>         Suppressed:
>     jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD
>     fd=46 type=directory path=.../WEB-INF/classes/WEB-INF/services
>             at
>     java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
>             at
>     java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
>             at
>     java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
>             at
>     java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
>
>
>     In both situations, if a checkpoint coincides with the execution
>     of these
>     periodic tasks, the checkpoint is likely to fail.
>
>     My current workaround is to attempt the checkpoint multiple times,
>     as it
>     will eventually succeed. While this allows me to bypass the issue,
>     I would
>     like to know if there is a more optimal solution.
>
>     Thank you.
>
>     Best regards,
>     mazhen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20251118/4c48ab50/attachment-0001.htm>


More information about the crac-dev mailing list