CRaC: CheckpointException with file descriptors from JVM internals and native calls

ma zhen mz1999 at gmail.com
Fri Nov 14 08:01:45 UTC 2025


Hi everyone,

Following up on my own question, I believe I've found a suitable solution
and wanted to share it for the archives.

The issue was resolved using the VM option
`-XX:CRaCAllowedOpenFilePrefixes`. This option lets you specify a
comma-separated list of path prefixes that CRaC should ignore if they are
found open during a checkpoint.

(Reference: https://docs.azul.com/crac/usage/vm-options)

Crucially, and what makes it a perfect solution for my original problem, is
that this option works for files opened by native code (e.g., via JNI or
internal JVM functions). This is why it can handle the file descriptors
that were not manageable through standard CRaC resource policies.

This directly addresses the two scenarios I described:

1. For the cgroup file opened by `OperatingSystemMXBean`, I can now add
   `/sys/fs/cgroup/` to the allowed prefixes.

2. For the directory descriptor held open by the native implementation of
   `File.list`, adding the application's base path works perfectly.

This provides a much more robust solution than retrying the checkpoint. I
hope this is helpful for anyone else running into similar issues.

Best regards,
mazhen

ma zhen <mz1999 at gmail.com> 于2025年11月12日周三 17:29写道:

> Hi everyone,
>
> I'm encountering a CheckpointException when creating a checkpoint image
> with CRaC. The root cause is that the application holds file descriptors
> for files or directories.
>
> Our application is quite complex, and after some investigation, I've found
> that these files/directories are being opened by third-party libraries.
> The challenge is that they are not opened through regular file I/O APIs,
> which makes it impossible to handle them using File Descriptor Policies.
>
> I've identified two specific scenarios:
>
> 1. A third-party library periodically fetches system resource information,
>    which includes calling `OperatingSystemMXBean.getAvailableProcessors`.
>
>    When the JVM determines the number of available CPU cores, if it detects
>    that cgroups are available, it will read the resource limit file
>    `cpu.cfs_quota_us`, even if the process is not in a container.
>    The specific implementation logic can be found in
> cgroupV1Subsystem_linux.cpp:
>    (
> https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp
> )
>
>    If a checkpoint is triggered at this exact moment, an exception
>    similar to the following occurs:
>
>     Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException:
> FD fd=57 type=regular
> path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us
>         at
> java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
>         at
> java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
>         at
> java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
>         at
> java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
>
> 2. For some reason, a third-party library periodically calls `File.list`
>    to get the list of files in a specific directory.
>
>    On Linux, the `list` method eventually calls the JNI method
>    `Java_java_io_UnixFileSystem_list` which holds a directory file
>    descriptor during its execution. This is defined in UnixFileSystem_md.c:
>    (
> https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c
> )
>
>    Similarly, if a checkpoint is triggered at this moment, an exception
>    like the one below is thrown:
>
>     jdk.internal.crac.mirror.CheckpointException
>     Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException:
> FD fd=46 type=directory path=.../WEB-INF/classes/WEB-INF/services
>         at
> java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
>         at
> java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)
>         at
> java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)
>         at
> java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)
>
>
> In both situations, if a checkpoint coincides with the execution of these
> periodic tasks, the checkpoint is likely to fail.
>
> My current workaround is to attempt the checkpoint multiple times, as it
> will eventually succeed. While this allows me to bypass the issue, I would
> like to know if there is a more optimal solution.
>
> Thank you.
>
> Best regards,
> mazhen
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20251114/c6ea63d3/attachment.htm>


More information about the crac-dev mailing list