<div dir="ltr">Hi Radim,<br><br>Thank you for your detailed and candid feedback.<br><br>I fully agree with your assessment regarding both scenarios. You've clearly articulated why FD policies and CRaCAllowedOpenFilePrefixes are workarounds, and that a more transparent solution for JVM internals like getAvailableProcessors() is indeed the proper way forward.<br><br>Regarding the getAvailableProcessors() issue and your suggestion for a PR, my current thinking is to introduce a lightweight synchronization mechanism in the native CRaC code. This would involve an RAII-style guard to mark the critical section during cgroup file access, ensuring mutual exclusion with checkpoint operations.<br><br>I would be glad to attempt implementing this and contributing a PR with a test case. <br><br>Best regards,<br>mazhen</div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">Radim Vansa <<a href="mailto:rvansa@azul.com">rvansa@azul.com</a>> 于2025年11月19日周三 05:13写道:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>

  
  <div>
    <p>Hello ma zhen,</p>
    <p>apologies for an untimely response.</p>
    <p>In general, both FD policies and CRaCAllowedOpenFilePrefixes are
      really a workaround for apps that don't adhere to CRaC
      requirements, rather than a proper solutions. But let's talk about
      the problems individually:</p>
    <p>1) When it comes to getAvailableProcessors() I think that opening
      the cgroups info is an implementation detail, and CRaC JVM should
      handle that transparently. There should be a hook (either in Java
      code or in native, whichever is less intrusive) that will make the
      file access and C/R mutually exclusive. We will gladly accept a PR
      (with a test case, please).</p>
    <p>2) Listing files is an interaction with the environment, and
      application should stop that during C/R. Your observation about FD
      policies makes sense; in fact in this case there is no resource
      that could be linked into the FD policies; we would have to
      explicitly synchronize with C/R and that would be expensive on
      such a common function. From practical POV I understand that you
      can't easily modify the 3rd party library and I am glad that it
      works for you. Note though, that CRaCAllowedOpenFilePrefixes
      basically relies on C/R engine to handle that FD correctly. And if
      you attempt to restore on a system that does not host this
      directory, the restore will fail.</p>
    <p>Technically the getAvailableProcessors() is also an interaction
      with the 'environment', with the machine it is currently running,
      but the world is not black and white and my opinion is that this
      should be transparent.</p>
    <p>Radim</p>
    <div>On 11/14/25 09:01, ma zhen wrote:<br>
    </div>
    <blockquote type="cite">
      <table width="100%">
        <tbody>
          <tr>
            <td><br>
            </td>
            <td width="100%">
              <div><span>Caution:</span> This email originated from
                outside of the organization. Do not click links or open
                attachments unless you recognize the sender and know the
                content is safe.
              </div>
            </td>
          </tr>
        </tbody>
      </table>
      <br>
      <div>
        <div dir="ltr">
          <div dir="ltr">
            <div>Hi everyone,</div>
            <div><br>
            </div>
            <div>Following up on my own question, I believe I've found a
              suitable solution and wanted to share it for the archives.</div>
            <div><br>
            </div>
            <div>The issue was resolved using the VM option
              `-XX:CRaCAllowedOpenFilePrefixes`. This option lets you
              specify a comma-separated list of path prefixes that CRaC
              should ignore if they are found open during a checkpoint.</div>
            <div><br>
            </div>
            <div>(Reference: <a href="https://docs.azul.com/crac/usage/vm-options" target="_blank">https://docs.azul.com/crac/usage/vm-options</a>)</div>
            <div><br>
            </div>
            <div>Crucially, and what makes it a perfect solution for my
              original problem, is that this option works for files
              opened by native code (e.g., via JNI or internal JVM
              functions). This is why it can handle the file descriptors
              that were not manageable through standard CRaC resource
              policies.</div>
            <div><br>
            </div>
            <div>This directly addresses the two scenarios I described:</div>
            <div><br>
            </div>
            <div>1. For the cgroup file opened by
              `OperatingSystemMXBean`, I can now add</div>
            <div>   `/sys/fs/cgroup/` to the allowed prefixes.</div>
            <div><br>
            </div>
            <div>2. For the directory descriptor held open by the native
              implementation of</div>
            <div>   `File.list`, adding the application's base path
              works perfectly.</div>
            <div><br>
            </div>
            <div>This provides a much more robust solution than retrying
              the checkpoint. I hope this is helpful for anyone else
              running into similar issues.</div>
            <div><br>
            </div>
            <div>Best regards,</div>
            <div>mazhen</div>
          </div>
        </div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">ma zhen <<a href="mailto:mz1999@gmail.com" target="_blank">mz1999@gmail.com</a>>
            于2025年11月12日周三 17:29写道:<br>
          </div>
          <blockquote class="gmail_quote">
            <div dir="ltr">
              <div dir="ltr">
                <div dir="ltr">
                  <div dir="ltr">
                    <div dir="ltr">
                      <div dir="ltr">
                        <div>Hi everyone,</div>
                        <div><br>
                        </div>
                        <div>I'm encountering a CheckpointException when
                          creating a checkpoint image</div>
                        <div>with CRaC. The root cause is that the
                          application holds file descriptors</div>
                        <div>for files or directories.</div>
                        <div><br>
                        </div>
                        <div>
                          <div>Our application is quite complex, and
                            after some investigation, I've found </div>
                          <div>that these files/directories are being
                            opened by third-party libraries. </div>
                          <div>The challenge is that they are not opened
                            through regular file I/O APIs, </div>
                          <div>which makes it impossible to handle them
                            using File Descriptor Policies.</div>
                        </div>
                        <div><br>
                        </div>
                        <div>I've identified two specific scenarios:</div>
                        <div><br>
                        </div>
                        <div>1. A third-party library periodically
                          fetches system resource information,</div>
                        <div>   which includes calling
                          `OperatingSystemMXBean.getAvailableProcessors`.</div>
                        <div><br>
                        </div>
                        <div>   When the JVM determines the number of
                          available CPU cores, if it detects</div>
                        <div>   that cgroups are available, it will read
                          the resource limit file</div>
                        <div>   `cpu.cfs_quota_us`, even if the process
                          is not in a container.</div>
                        <div>   The specific implementation logic can be
                          found in cgroupV1Subsystem_linux.cpp:</div>
                        <div>   (<a href="https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp" target="_blank">https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp</a>)</div>
                        <div><br>
                        </div>
                        <div>   If a checkpoint is triggered at this
                          exact moment, an exception</div>
                        <div>   similar to the following occurs:</div>
                        <div><br>
                        </div>
                        <div>    Suppressed:
                          jdk.internal.crac.mirror.impl.CheckpointOpenFileException:
                          FD fd=57 type=regular
                          path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us</div>
                        <div>        at
java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(<a>Core.java:115</a>)</div>
                        <div>        at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(<a>Core.java:189</a>)</div>
                        <div>        at
                          java.base/jdk.internal.crac.mirror.Core.checkpointRestore(<a>Core.java:315</a>)</div>
                        <div>        at
java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(<a>Core.java:328</a>)</div>
                        <div><br>
                        </div>
                        <div>2. For some reason, a third-party library
                          periodically calls `File.list`</div>
                        <div>   to get the list of files in a specific
                          directory.</div>
                        <div><br>
                        </div>
                        <div>   On Linux, the `list` method eventually
                          calls the JNI method</div>
                        <div>   `Java_java_io_UnixFileSystem_list` which
                          holds a directory file</div>
                        <div>   descriptor during its execution. This is
                          defined in UnixFileSystem_md.c:</div>
                        <div>   (<a href="https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c" target="_blank">https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c</a>)</div>
                        <div><br>
                        </div>
                        <div>   Similarly, if a checkpoint is triggered
                          at this moment, an exception</div>
                        <div>   like the one below is thrown:</div>
                        <div><br>
                        </div>
                        <div>   
                          jdk.internal.crac.mirror.CheckpointException</div>
                        <div>    Suppressed:
                          jdk.internal.crac.mirror.impl.CheckpointOpenFileException:
                          FD fd=46 type=directory
                          path=.../WEB-INF/classes/WEB-INF/services</div>
                        <div>        at
java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(<a>Core.java:115</a>)</div>
                        <div>        at
java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(<a>Core.java:189</a>)</div>
                        <div>        at
                          java.base/jdk.internal.crac.mirror.Core.checkpointRestore(<a>Core.java:315</a>)</div>
                        <div>        at
java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(<a>Core.java:328</a>)</div>
                        <div><br>
                        </div>
                        <div><br>
                        </div>
                        <div>In both situations, if a checkpoint
                          coincides with the execution of these</div>
                        <div>periodic tasks, the checkpoint is likely to
                          fail.</div>
                        <div><br>
                        </div>
                        <div>My current workaround is to attempt the
                          checkpoint multiple times, as it</div>
                        <div>will eventually succeed. While this allows
                          me to bypass the issue, I would</div>
                        <div>like to know if there is a more optimal
                          solution.</div>
                        <div><br>
                        </div>
                        <div>Thank you.</div>
                        <div><br>
                        </div>
                        <div>
                          <div>Best regards,</div>
                          <div>mazhen</div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
      </div>
    </blockquote>
  </div>

</blockquote></div>