CRaC: CheckpointException with file descriptors from JVM internals and native calls
ma zhen
mz1999 at gmail.com
Wed Nov 12 09:29:27 UTC 2025
Hi everyone,
I'm encountering a CheckpointException when creating a checkpoint image
with CRaC. The root cause is that the application holds file descriptors
for files or directories.
Our application is quite complex, and after some investigation, I've found
that these files/directories are being opened by third-party libraries.
The challenge is that they are not opened through regular file I/O APIs,
which makes it impossible to handle them using File Descriptor Policies.
I've identified two specific scenarios:
1. A third-party library periodically fetches system resource information,
which includes calling `OperatingSystemMXBean.getAvailableProcessors`.
When the JVM determines the number of available CPU cores, if it detects
that cgroups are available, it will read the resource limit file
`cpu.cfs_quota_us`, even if the process is not in a container.
The specific implementation logic can be found in
cgroupV1Subsystem_linux.cpp:
(
https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp
)
If a checkpoint is triggered at this exact moment, an exception
similar to the following occurs:
Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException:
FD fd=57 type=regular
path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us
at
java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
at
java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
2. For some reason, a third-party library periodically calls `File.list`
to get the list of files in a specific directory.
On Linux, the `list` method eventually calls the JNI method
`Java_java_io_UnixFileSystem_list` which holds a directory file
descriptor during its execution. This is defined in UnixFileSystem_md.c:
(
https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c
)
Similarly, if a checkpoint is triggered at this moment, an exception
like the one below is thrown:
jdk.internal.crac.mirror.CheckpointException
Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException:
FD fd=46 type=directory path=.../WEB-INF/classes/WEB-INF/services
at
java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
at
java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
In both situations, if a checkpoint coincides with the execution of these
periodic tasks, the checkpoint is likely to fail.
My current workaround is to attempt the checkpoint multiple times, as it
will eventually succeed. While this allows me to bypass the issue, I would
like to know if there is a more optimal solution.
Thank you.
Best regards,
mazhen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20251112/ce8f92f6/attachment-0001.htm>
More information about the crac-dev
mailing list