<div dir="ltr"><div dir="ltr"><div>Hi everyone,</div><div><br></div><div>Following up on my own question, I believe I've found a suitable solution and wanted to share it for the archives.</div><div><br></div><div>The issue was resolved using the VM option `-XX:CRaCAllowedOpenFilePrefixes`. This option lets you specify a comma-separated list of path prefixes that CRaC should ignore if they are found open during a checkpoint.</div><div><br></div><div>(Reference: <a href="https://docs.azul.com/crac/usage/vm-options">https://docs.azul.com/crac/usage/vm-options</a>)</div><div><br></div><div>Crucially, and what makes it a perfect solution for my original problem, is that this option works for files opened by native code (e.g., via JNI or internal JVM functions). This is why it can handle the file descriptors that were not manageable through standard CRaC resource policies.</div><div><br></div><div>This directly addresses the two scenarios I described:</div><div><br></div><div>1. For the cgroup file opened by `OperatingSystemMXBean`, I can now add</div><div> `/sys/fs/cgroup/` to the allowed prefixes.</div><div><br></div><div>2. For the directory descriptor held open by the native implementation of</div><div> `File.list`, adding the application's base path works perfectly.</div><div><br></div><div>This provides a much more robust solution than retrying the checkpoint. I hope this is helpful for anyone else running into similar issues.</div><div><br></div><div>Best regards,</div><div>mazhen</div></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">ma zhen <<a href="mailto:mz1999@gmail.com">mz1999@gmail.com</a>> 于2025年11月12日周三 17:29写道:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hi everyone,</div><div><br></div><div>I'm encountering a CheckpointException when creating a checkpoint image</div><div>with CRaC. The root cause is that the application holds file descriptors</div><div>for files or directories.</div><div><br></div><div><div>Our application is quite complex, and after some investigation, I've found </div><div>that these files/directories are being opened by third-party libraries. </div><div>The challenge is that they are not opened through regular file I/O APIs, </div><div>which makes it impossible to handle them using File Descriptor Policies.</div></div><div><br></div><div>I've identified two specific scenarios:</div><div><br></div><div>1. A third-party library periodically fetches system resource information,</div><div> which includes calling `OperatingSystemMXBean.getAvailableProcessors`.</div><div><br></div><div> When the JVM determines the number of available CPU cores, if it detects</div><div> that cgroups are available, it will read the resource limit file</div><div> `cpu.cfs_quota_us`, even if the process is not in a container.</div><div> The specific implementation logic can be found in cgroupV1Subsystem_linux.cpp:</div><div> (<a href="https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp" target="_blank">https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp</a>)</div><div><br></div><div> If a checkpoint is triggered at this exact moment, an exception</div><div> similar to the following occurs:</div><div><br></div><div> Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=57 type=regular path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us</div><div> at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)</div><div> at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)</div><div> at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)</div><div> at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)</div><div><br></div><div>2. For some reason, a third-party library periodically calls `File.list`</div><div> to get the list of files in a specific directory.</div><div><br></div><div> On Linux, the `list` method eventually calls the JNI method</div><div> `Java_java_io_UnixFileSystem_list` which holds a directory file</div><div> descriptor during its execution. This is defined in UnixFileSystem_md.c:</div><div> (<a href="https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c" target="_blank">https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c</a>)</div><div><br></div><div> Similarly, if a checkpoint is triggered at this moment, an exception</div><div> like the one below is thrown:</div><div><br></div><div> jdk.internal.crac.mirror.CheckpointException</div><div> Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=46 type=directory path=.../WEB-INF/classes/WEB-INF/services</div><div> at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)</div><div> at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189)</div><div> at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315)</div><div> at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328)</div><div><br></div><div><br></div><div>In both situations, if a checkpoint coincides with the execution of these</div><div>periodic tasks, the checkpoint is likely to fail.</div><div><br></div><div>My current workaround is to attempt the checkpoint multiple times, as it</div><div>will eventually succeed. While this allows me to bypass the issue, I would</div><div>like to know if there is a more optimal solution.</div><div><br></div><div>Thank you.</div><div><br></div><div><div>Best regards,</div><div>mazhen</div></div></div></div></div></div></div></div>
</blockquote></div>