CRaC: CheckpointException with file descriptors from JVM internals and native calls

ma zhen mz1999 at gmail.com
Fri Nov 21 07:12:15 UTC 2025


Hi Radim,

Thank you for your detailed and candid feedback.

I fully agree with your assessment regarding both scenarios. You've clearly
articulated why FD policies and CRaCAllowedOpenFilePrefixes are
workarounds, and that a more transparent solution for JVM internals like
getAvailableProcessors() is indeed the proper way forward.

Regarding the getAvailableProcessors() issue and your suggestion for a PR,
my current thinking is to introduce a lightweight synchronization mechanism
in the native CRaC code. This would involve an RAII-style guard to mark the
critical section during cgroup file access, ensuring mutual exclusion with
checkpoint operations.

I would be glad to attempt implementing this and contributing a PR with a
test case.

Best regards,
mazhen

Radim Vansa <rvansa at azul.com> 于2025年11月19日周三 05:13写道:

> Hello ma zhen,
>
> apologies for an untimely response.
>
> In general, both FD policies and CRaCAllowedOpenFilePrefixes are really a
> workaround for apps that don't adhere to CRaC requirements, rather than a
> proper solutions. But let's talk about the problems individually:
>
> 1) When it comes to getAvailableProcessors() I think that opening the
> cgroups info is an implementation detail, and CRaC JVM should handle that
> transparently. There should be a hook (either in Java code or in native,
> whichever is less intrusive) that will make the file access and C/R
> mutually exclusive. We will gladly accept a PR (with a test case, please).
>
> 2) Listing files is an interaction with the environment, and application
> should stop that during C/R. Your observation about FD policies makes
> sense; in fact in this case there is no resource that could be linked into
> the FD policies; we would have to explicitly synchronize with C/R and that
> would be expensive on such a common function. From practical POV I
> understand that you can't easily modify the 3rd party library and I am glad
> that it works for you. Note though, that CRaCAllowedOpenFilePrefixes
> basically relies on C/R engine to handle that FD correctly. And if you
> attempt to restore on a system that does not host this directory, the
> restore will fail.
>
> Technically the getAvailableProcessors() is also an interaction with the
> 'environment', with the machine it is currently running, but the world is
> not black and white and my opinion is that this should be transparent.
>
> Radim
> On 11/14/25 09:01, ma zhen wrote:
>
>
> Caution: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> Hi everyone,
>
> Following up on my own question, I believe I've found a suitable solution
> and wanted to share it for the archives.
>
> The issue was resolved using the VM option
> `-XX:CRaCAllowedOpenFilePrefixes`. This option lets you specify a
> comma-separated list of path prefixes that CRaC should ignore if they are
> found open during a checkpoint.
>
> (Reference: https://docs.azul.com/crac/usage/vm-options)
>
> Crucially, and what makes it a perfect solution for my original problem,
> is that this option works for files opened by native code (e.g., via JNI or
> internal JVM functions). This is why it can handle the file descriptors
> that were not manageable through standard CRaC resource policies.
>
> This directly addresses the two scenarios I described:
>
> 1. For the cgroup file opened by `OperatingSystemMXBean`, I can now add
>    `/sys/fs/cgroup/` to the allowed prefixes.
>
> 2. For the directory descriptor held open by the native implementation of
>    `File.list`, adding the application's base path works perfectly.
>
> This provides a much more robust solution than retrying the checkpoint. I
> hope this is helpful for anyone else running into similar issues.
>
> Best regards,
> mazhen
>
> ma zhen <mz1999 at gmail.com> 于2025年11月12日周三 17:29写道:
>
>> Hi everyone,
>>
>> I'm encountering a CheckpointException when creating a checkpoint image
>> with CRaC. The root cause is that the application holds file descriptors
>> for files or directories.
>>
>> Our application is quite complex, and after some investigation, I've
>> found
>> that these files/directories are being opened by third-party libraries.
>> The challenge is that they are not opened through regular file I/O APIs,
>> which makes it impossible to handle them using File Descriptor Policies.
>>
>> I've identified two specific scenarios:
>>
>> 1. A third-party library periodically fetches system resource information,
>>    which includes calling `OperatingSystemMXBean.getAvailableProcessors`.
>>
>>    When the JVM determines the number of available CPU cores, if it
>> detects
>>    that cgroups are available, it will read the resource limit file
>>    `cpu.cfs_quota_us`, even if the process is not in a container.
>>    The specific implementation logic can be found in
>> cgroupV1Subsystem_linux.cpp:
>>    (
>> https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp
>> )
>>
>>    If a checkpoint is triggered at this exact moment, an exception
>>    similar to the following occurs:
>>
>>     Suppressed:
>> jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=57
>> type=regular path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us
>>         at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(
>> Core.java:115)
>>         at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(
>> Core.java:189)
>>         at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(
>> Core.java:315)
>>         at
>> java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(
>> Core.java:328)
>>
>> 2. For some reason, a third-party library periodically calls `File.list`
>>    to get the list of files in a specific directory.
>>
>>    On Linux, the `list` method eventually calls the JNI method
>>    `Java_java_io_UnixFileSystem_list` which holds a directory file
>>    descriptor during its execution. This is defined in
>> UnixFileSystem_md.c:
>>    (
>> https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c
>> )
>>
>>    Similarly, if a checkpoint is triggered at this moment, an exception
>>    like the one below is thrown:
>>
>>     jdk.internal.crac.mirror.CheckpointException
>>     Suppressed:
>> jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=46
>> type=directory path=.../WEB-INF/classes/WEB-INF/services
>>         at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(
>> Core.java:115)
>>         at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(
>> Core.java:189)
>>         at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(
>> Core.java:315)
>>         at
>> java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(
>> Core.java:328)
>>
>>
>> In both situations, if a checkpoint coincides with the execution of these
>> periodic tasks, the checkpoint is likely to fail.
>>
>> My current workaround is to attempt the checkpoint multiple times, as it
>> will eventually succeed. While this allows me to bypass the issue, I would
>> like to know if there is a more optimal solution.
>>
>> Thank you.
>>
>> Best regards,
>> mazhen
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20251121/697269fc/attachment.htm>


More information about the crac-dev mailing list