<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>That's great, while currently similar hooks don't use RAII I
think that's a reliable way to implement this.</p>
<p>Please make sure that your implementation uses RW locking, not
forcing mutual exclusion and unintended synchronization when the
checkpoint is not happening. Alternatively, it might be possible
to mark the entry to this section as critical and prevent VM
thread from executing the C/R; I am not sure which alternative is
more lightweight.</p>
<p>Thanks in advance for the contribution!</p>
<p>Radim</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 11/21/25 08:12, ma zhen wrote:<br>
</div>
<blockquote type="cite" cite="mid:CA+U33_P5TzLbpH18J62925vDVYqo7_LS=M1oj7DuMagUwdnjEg@mail.gmail.com">
<table width="100%">
<tbody>
<tr>
<td><br>
</td>
<td width="100%">
<div><span>Caution:</span> This email originated from
outside of the organization. Do not click links or open
attachments unless you recognize the sender and know the
content is safe.
</div>
</td>
</tr>
</tbody>
</table>
<br>
<div>
<div dir="ltr">Hi Radim,<br>
<br>
Thank you for your detailed and candid feedback.<br>
<br>
I fully agree with your assessment regarding both scenarios.
You've clearly articulated why FD policies and
CRaCAllowedOpenFilePrefixes are workarounds, and that a more
transparent solution for JVM internals like
getAvailableProcessors() is indeed the proper way forward.<br>
<br>
Regarding the getAvailableProcessors() issue and your
suggestion for a PR, my current thinking is to introduce a
lightweight synchronization mechanism in the native CRaC code.
This would involve an RAII-style guard to mark the critical
section during cgroup file access, ensuring mutual exclusion
with checkpoint operations.<br>
<br>
I would be glad to attempt implementing this and contributing
a PR with a test case.
<br>
<br>
Best regards,<br>
mazhen</div>
<br>
<div class="gmail_quote gmail_quote_container">
<div dir="ltr" class="gmail_attr">Radim Vansa <<a href="mailto:rvansa@azul.com" moz-do-not-send="true" class="moz-txt-link-freetext">rvansa@azul.com</a>>
于2025年11月19日周三 05:13写道:<br>
</div>
<blockquote class="gmail_quote">
<div>
<p>Hello ma zhen,</p>
<p>apologies for an untimely response.</p>
<p>In general, both FD policies and
CRaCAllowedOpenFilePrefixes are really a workaround for
apps that don't adhere to CRaC requirements, rather than
a proper solutions. But let's talk about the problems
individually:</p>
<p>1) When it comes to getAvailableProcessors() I think
that opening the cgroups info is an implementation
detail, and CRaC JVM should handle that transparently.
There should be a hook (either in Java code or in
native, whichever is less intrusive) that will make the
file access and C/R mutually exclusive. We will gladly
accept a PR (with a test case, please).</p>
<p>2) Listing files is an interaction with the
environment, and application should stop that during
C/R. Your observation about FD policies makes sense; in
fact in this case there is no resource that could be
linked into the FD policies; we would have to explicitly
synchronize with C/R and that would be expensive on such
a common function. From practical POV I understand that
you can't easily modify the 3rd party library and I am
glad that it works for you. Note though,
that CRaCAllowedOpenFilePrefixes basically relies on C/R
engine to handle that FD correctly. And if you attempt
to restore on a system that does not host this
directory, the restore will fail.</p>
<p>Technically the getAvailableProcessors() is also an
interaction with the 'environment', with the machine it
is currently running, but the world is not black and
white and my opinion is that this should be transparent.</p>
<p>Radim</p>
<div>On 11/14/25 09:01, ma zhen wrote:<br>
</div>
<blockquote type="cite">
<table width="100%">
<tbody>
<tr>
<td><br>
</td>
<td width="100%">
<div><span>Caution:</span> This email originated
from outside of the organization. Do not click
links or open attachments unless you recognize
the sender and know the content is safe.
</div>
</td>
</tr>
</tbody>
</table>
<br>
<div>
<div dir="ltr">
<div dir="ltr">
<div>Hi everyone,</div>
<div><br>
</div>
<div>Following up on my own question, I believe
I've found a suitable solution and wanted to
share it for the archives.</div>
<div><br>
</div>
<div>The issue was resolved using the VM option
`-XX:CRaCAllowedOpenFilePrefixes`. This option
lets you specify a comma-separated list of path
prefixes that CRaC should ignore if they are
found open during a checkpoint.</div>
<div><br>
</div>
<div>(Reference: <a href="https://docs.azul.com/crac/usage/vm-options" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">
https://docs.azul.com/crac/usage/vm-options</a>)</div>
<div><br>
</div>
<div>Crucially, and what makes it a perfect
solution for my original problem, is that this
option works for files opened by native code
(e.g., via JNI or internal JVM functions). This
is why it can handle the file descriptors that
were not manageable through standard CRaC
resource policies.</div>
<div><br>
</div>
<div>This directly addresses the two scenarios I
described:</div>
<div><br>
</div>
<div>1. For the cgroup file opened by
`OperatingSystemMXBean`, I can now add</div>
<div> `/sys/fs/cgroup/` to the allowed prefixes.</div>
<div><br>
</div>
<div>2. For the directory descriptor held open by
the native implementation of</div>
<div> `File.list`, adding the application's base
path works perfectly.</div>
<div><br>
</div>
<div>This provides a much more robust solution
than retrying the checkpoint. I hope this is
helpful for anyone else running into similar
issues.</div>
<div><br>
</div>
<div>Best regards,</div>
<div>mazhen</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">ma zhen <<a href="mailto:mz1999@gmail.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">mz1999@gmail.com</a>>
于2025年11月12日周三 17:29写道:<br>
</div>
<blockquote class="gmail_quote">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div>Hi everyone,</div>
<div><br>
</div>
<div>I'm encountering a
CheckpointException when creating a
checkpoint image</div>
<div>with CRaC. The root cause is that
the application holds file
descriptors</div>
<div>for files or directories.</div>
<div><br>
</div>
<div>
<div>Our application is quite
complex, and after some
investigation, I've found </div>
<div>that these files/directories
are being opened by third-party
libraries. </div>
<div>The challenge is that they are
not opened through regular file
I/O APIs, </div>
<div>which makes it impossible to
handle them using File Descriptor
Policies.</div>
</div>
<div><br>
</div>
<div>I've identified two specific
scenarios:</div>
<div><br>
</div>
<div>1. A third-party library
periodically fetches system resource
information,</div>
<div> which includes calling
`OperatingSystemMXBean.getAvailableProcessors`.</div>
<div><br>
</div>
<div> When the JVM determines the
number of available CPU cores, if it
detects</div>
<div> that cgroups are available, it
will read the resource limit file</div>
<div> `cpu.cfs_quota_us`, even if
the process is not in a container.</div>
<div> The specific implementation
logic can be found in
cgroupV1Subsystem_linux.cpp:</div>
<div> (<a href="https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp</a>)</div>
<div><br>
</div>
<div> If a checkpoint is triggered
at this exact moment, an exception</div>
<div> similar to the following
occurs:</div>
<div><br>
</div>
<div> Suppressed:
jdk.internal.crac.mirror.impl.CheckpointOpenFileException:
FD fd=57 type=regular
path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us</div>
<div> at
java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(<a moz-do-not-send="true">Core.java:115</a>)</div>
<div> at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(<a moz-do-not-send="true">Core.java:189</a>)</div>
<div> at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore(<a moz-do-not-send="true">Core.java:315</a>)</div>
<div> at
java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(<a moz-do-not-send="true">Core.java:328</a>)</div>
<div><br>
</div>
<div>2. For some reason, a third-party
library periodically calls
`File.list`</div>
<div> to get the list of files in a
specific directory.</div>
<div><br>
</div>
<div> On Linux, the `list` method
eventually calls the JNI method</div>
<div>
`Java_java_io_UnixFileSystem_list`
which holds a directory file</div>
<div> descriptor during its
execution. This is defined in
UnixFileSystem_md.c:</div>
<div> (<a href="https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c</a>)</div>
<div><br>
</div>
<div> Similarly, if a checkpoint is
triggered at this moment, an
exception</div>
<div> like the one below is thrown:</div>
<div><br>
</div>
<div>
jdk.internal.crac.mirror.CheckpointException</div>
<div> Suppressed:
jdk.internal.crac.mirror.impl.CheckpointOpenFileException:
FD fd=46 type=directory
path=.../WEB-INF/classes/WEB-INF/services</div>
<div> at
java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(<a moz-do-not-send="true">Core.java:115</a>)</div>
<div> at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(<a moz-do-not-send="true">Core.java:189</a>)</div>
<div> at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore(<a moz-do-not-send="true">Core.java:315</a>)</div>
<div> at
java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(<a moz-do-not-send="true">Core.java:328</a>)</div>
<div><br>
</div>
<div><br>
</div>
<div>In both situations, if a
checkpoint coincides with the
execution of these</div>
<div>periodic tasks, the checkpoint is
likely to fail.</div>
<div><br>
</div>
<div>My current workaround is to
attempt the checkpoint multiple
times, as it</div>
<div>will eventually succeed. While
this allows me to bypass the issue,
I would</div>
<div>like to know if there is a more
optimal solution.</div>
<div><br>
</div>
<div>Thank you.</div>
<div><br>
</div>
<div>
<div>Best regards,</div>
<div>mazhen</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</body>
</html>