From rvansa at openjdk.org Mon Nov 3 08:36:48 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 3 Nov 2025 08:36:48 GMT Subject: [crac] Integrated: 8370888: [CRaC] Use better source of random for UUID generation in checkpoint path In-Reply-To: References: Message-ID: On Thu, 30 Oct 2025 11:38:54 GMT, Radim Vansa wrote: > When `-XX:CRaCCheckpointTo` contains the `%u` placeholder to generate random UUID, it should use a different source than `os::random()` which provides not-seeded, deterministic values. This pull request has now been integrated. Changeset: ba48c708 Author: Radim Vansa URL: https://git.openjdk.org/crac/commit/ba48c7080737a73d94a0f55d25d922880954edce Stats: 64 lines in 6 files changed: 53 ins; 2 del; 9 mod 8370888: [CRaC] Use better source of random for UUID generation in checkpoint path Reviewed-by: tpushkin ------------- PR: https://git.openjdk.org/crac/pull/271 From rvansa at openjdk.org Mon Nov 3 08:37:44 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 3 Nov 2025 08:37:44 GMT Subject: [crac] RFR: 8370873: [CRaC] Launcher does not parse CRaC VM options from JDK_JAVA_OPTIONS In-Reply-To: References: Message-ID: On Thu, 30 Oct 2025 14:11:23 GMT, Radim Vansa wrote: > Java launcher parses command-line arguments as it needs to perform some operations before the normal parsing in `arguments.cpp`. However the options injected in JDK_JAVA_OPTIONS are ignored in this part (and so does parsing options from files...), leading to problems, e.g. checkpoint of PID 1. Test failure due to https://bugs.openjdk.org/browse/JDK-8368192 ------------- PR Comment: https://git.openjdk.org/crac/pull/272#issuecomment-3479410196 From rvansa at openjdk.org Mon Nov 3 08:37:45 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 3 Nov 2025 08:37:45 GMT Subject: [crac] Integrated: 8370873: [CRaC] Launcher does not parse CRaC VM options from JDK_JAVA_OPTIONS In-Reply-To: References: Message-ID: <-OKjVqmpJg_aLmOdYDdoImlbKbecoKSBWotHZln_ET0=.0966ede6-ddde-4002-859c-0d19bb758ef3@github.com> On Thu, 30 Oct 2025 14:11:23 GMT, Radim Vansa wrote: > Java launcher parses command-line arguments as it needs to perform some operations before the normal parsing in `arguments.cpp`. However the options injected in JDK_JAVA_OPTIONS are ignored in this part (and so does parsing options from files...), leading to problems, e.g. checkpoint of PID 1. This pull request has now been integrated. Changeset: dc980363 Author: Radim Vansa URL: https://git.openjdk.org/crac/commit/dc980363417f5d195c7cbbc37ab74c1cf3ee1c36 Stats: 91 lines in 3 files changed: 90 ins; 1 del; 0 mod 8370873: [CRaC] Launcher does not parse CRaC VM options from JDK_JAVA_OPTIONS Reviewed-by: tpushkin ------------- PR: https://git.openjdk.org/crac/pull/272 From rvansa at openjdk.org Tue Nov 4 15:28:07 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Tue, 4 Nov 2025 15:28:07 GMT Subject: [crac] RFR: 8371250: [CRaC] Remove old pid file in ImageScoreTest Message-ID: The test is executed twice (with `-XX:[+-]UsePerfData`), and when the pauseengine pidfile is not removed this could lead to races. ------------- Commit messages: - 8371250: [CRaC] Remove old pid file in ImageScoreTest Changes: https://git.openjdk.org/crac/pull/273/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=273&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8371250 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/273.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/273/head:pull/273 PR: https://git.openjdk.org/crac/pull/273 From tpushkin at openjdk.org Tue Nov 4 15:30:56 2025 From: tpushkin at openjdk.org (Timofei Pushkin) Date: Tue, 4 Nov 2025 15:30:56 GMT Subject: [crac] RFR: 8371250: [CRaC] Remove old pid file in ImageScoreTest In-Reply-To: References: Message-ID: On Tue, 4 Nov 2025 15:20:44 GMT, Radim Vansa wrote: > The test is executed twice (with `-XX:[+-]UsePerfData`), and when the pauseengine pidfile is not removed this could lead to races. Marked as reviewed by tpushkin (Committer). ------------- PR Review: https://git.openjdk.org/crac/pull/273#pullrequestreview-3417101026 From rvansa at openjdk.org Wed Nov 5 07:33:30 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 5 Nov 2025 07:33:30 GMT Subject: git: openjdk/crac: crac: 8371250: [CRaC] Remove old pid file in ImageScoreTest Message-ID: Changeset: 3468d77c Branch: crac Author: Radim Vansa Date: 2025-11-05 07:32:07 +0000 URL: https://git.openjdk.org/crac/commit/3468d77c35a7ff9239381bee07c31d8a5e88fc53 8371250: [CRaC] Remove old pid file in ImageScoreTest Reviewed-by: tpushkin ! test/jdk/jdk/crac/engine/ImageScoreTest.java From rvansa at openjdk.org Wed Nov 5 07:34:47 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 5 Nov 2025 07:34:47 GMT Subject: [crac] Integrated: 8371250: [CRaC] Remove old pid file in ImageScoreTest In-Reply-To: References: Message-ID: On Tue, 4 Nov 2025 15:20:44 GMT, Radim Vansa wrote: > The test is executed twice (with `-XX:[+-]UsePerfData`), and when the pauseengine pidfile is not removed this could lead to races. This pull request has now been integrated. Changeset: 3468d77c Author: Radim Vansa URL: https://git.openjdk.org/crac/commit/3468d77c35a7ff9239381bee07c31d8a5e88fc53 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod 8371250: [CRaC] Remove old pid file in ImageScoreTest Reviewed-by: tpushkin ------------- PR: https://git.openjdk.org/crac/pull/273 From rvansa at openjdk.org Wed Nov 5 12:28:34 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 5 Nov 2025 12:28:34 GMT Subject: [crac] RFR: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore Message-ID: A JVM that executed OperatingSystemImpl.getProcessCpuLoad() before checkpoint can fail with assertion failure after restore with: java: /home/rvansa/work/zulu/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c:291: get_cpuload_internal: Assertion `pticks->usedKernel >= tmp.usedKernel' failed. This is an assertion failure, therefore failing only in debug builds, and providing probably a non-sense value in release builds. We should remove the assertion and return a negative value (documented as value for ?unavailable?) if this is detected. ------------- Commit messages: - 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore Changes: https://git.openjdk.org/crac/pull/274/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=274&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8371337 Stats: 65 lines in 2 files changed: 62 ins; 3 del; 0 mod Patch: https://git.openjdk.org/crac/pull/274.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/274/head:pull/274 PR: https://git.openjdk.org/crac/pull/274 From rvansa at openjdk.org Fri Nov 7 10:05:13 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 7 Nov 2025 10:05:13 GMT Subject: [crac] RFR: 8371462: [CRaC] Improve listening socket reopen through FD policies Message-ID: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> The current tests for listening socket reopen have two defects: * there is no thread calling accept() during the checkpoint, while normally servers would be waiting for new connections * non-blocking server sockets where `Selector.select()` is used for the blocking is not exercised at all As a result the scenario with Selector.select() fails to even close such server socket, due to FDs not in the `EpollSelectorImpl` not being processed after the selection keys are cancelled. Second problem is the reopen: servers can use long-lived key with `SelectionKey.OP_ACCEPT` interest. When the server socket is closed the key is cancelled; to handle this transparently we need to make the key valid again and register it with the channel and selector. ------------- Commit messages: - 8371462: [CRaC] Improve listening socket reopen through FD policies Changes: https://git.openjdk.org/crac/pull/275/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=275&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8371462 Stats: 336 lines in 16 files changed: 262 ins; 23 del; 51 mod Patch: https://git.openjdk.org/crac/pull/275.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/275/head:pull/275 PR: https://git.openjdk.org/crac/pull/275 From rvansa at openjdk.org Fri Nov 7 16:15:52 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 7 Nov 2025 16:15:52 GMT Subject: [crac] RFR: 8371462: [CRaC] Improve listening socket reopen through FD policies In-Reply-To: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> References: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> Message-ID: On Fri, 7 Nov 2025 10:00:10 GMT, Radim Vansa wrote: > The current tests for listening socket reopen have two defects: > * there is no thread calling accept() during the checkpoint, while normally servers would be waiting for new connections > * non-blocking server sockets where `Selector.select()` is used for the blocking is not exercised at all > > As a result the scenario with Selector.select() fails to even close such server socket, due to FDs not in the `EpollSelectorImpl` not being processed after the selection keys are cancelled. > > Second problem is the reopen: servers can use long-lived key with `SelectionKey.OP_ACCEPT` interest. When the server socket is closed the key is cancelled; to handle this transparently we need to make the key valid again and register it with the channel and selector. I'll investigate the failures on other platforms; I expect the fixes to not affect existing code (for Linux) too much. Please proceed with review. ------------- PR Comment: https://git.openjdk.org/crac/pull/275#issuecomment-3503450733 From rvansa at openjdk.org Mon Nov 10 10:26:43 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 10 Nov 2025 10:26:43 GMT Subject: [crac] RFR: 8371462: [CRaC] Improve listening socket reopen through FD policies [v2] In-Reply-To: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> References: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> Message-ID: > The current tests for listening socket reopen have two defects: > * there is no thread calling accept() during the checkpoint, while normally servers would be waiting for new connections > * non-blocking server sockets where `Selector.select()` is used for the blocking is not exercised at all > > As a result the scenario with Selector.select() fails to even close such server socket, due to FDs not in the `EpollSelectorImpl` not being processed after the selection keys are cancelled. > > Second problem is the reopen: servers can use long-lived key with `SelectionKey.OP_ACCEPT` interest. When the server socket is closed the key is cancelled; to handle this transparently we need to make the key valid again and register it with the channel and selector. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Disable selector test on Linux ------------- Changes: - all: https://git.openjdk.org/crac/pull/275/files - new: https://git.openjdk.org/crac/pull/275/files/b0e142eb..de7ff6e0 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=275&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=275&range=00-01 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/275.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/275/head:pull/275 PR: https://git.openjdk.org/crac/pull/275 From rvansa at openjdk.org Mon Nov 10 10:26:44 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 10 Nov 2025 10:26:44 GMT Subject: [crac] RFR: 8371462: [CRaC] Improve listening socket reopen through FD policies In-Reply-To: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> References: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> Message-ID: On Fri, 7 Nov 2025 10:00:10 GMT, Radim Vansa wrote: > The current tests for listening socket reopen have two defects: > * there is no thread calling accept() during the checkpoint, while normally servers would be waiting for new connections > * non-blocking server sockets where `Selector.select()` is used for the blocking is not exercised at all > > As a result the scenario with Selector.select() fails to even close such server socket, due to FDs not in the `EpollSelectorImpl` not being processed after the selection keys are cancelled. > > Second problem is the reopen: servers can use long-lived key with `SelectionKey.OP_ACCEPT` interest. When the server socket is closed the key is cancelled; to handle this transparently we need to make the key valid again and register it with the channel and selector. I realized that we don't support C/R in the selector implementation on other platforms either. Filed https://bugs.openjdk.org/browse/JDK-8371549 and deferred the resolution. ------------- PR Comment: https://git.openjdk.org/crac/pull/275#issuecomment-3510694499 From tpushkin at openjdk.org Wed Nov 12 08:29:44 2025 From: tpushkin at openjdk.org (Timofei Pushkin) Date: Wed, 12 Nov 2025 08:29:44 GMT Subject: [crac] RFR: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore In-Reply-To: References: Message-ID: On Wed, 5 Nov 2025 12:21:23 GMT, Radim Vansa wrote: > A JVM that executed OperatingSystemImpl.getProcessCpuLoad() before checkpoint can fail with assertion failure after restore with: > > java: /home/rvansa/work/zulu/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c:291: get_cpuload_internal: Assertion `pticks->usedKernel >= tmp.usedKernel' failed. > > This is an assertion failure, therefore failing only in debug builds, and providing probably a non-sense value in release builds. We should remove the assertion and return a negative value (documented as value for ?unavailable?) if this is detected. src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c line 290: > 288: // After restoring with CRaC the new process can appear 'younger' > 289: // than last value in counters - we will return -1 (unavailable). > 290: if (!failed && pticks->usedKernel >= tmp.usedKernel) { I haven't read the precise definitions of `used`, `usedKernel`, `total`, is checking only `usedKernel` enough to guarantee the comparison is also valid for the other two values? Shouldn't we check both `used` and `usedKernel` (assuming `total` is exactly their sum, which may not be the case) in case only one of them has gone backwards? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/274#discussion_r2517357741 From tpushkin at openjdk.org Wed Nov 12 09:00:29 2025 From: tpushkin at openjdk.org (Timofei Pushkin) Date: Wed, 12 Nov 2025 09:00:29 GMT Subject: [crac] RFR: 8371462: [CRaC] Improve listening socket reopen through FD policies [v2] In-Reply-To: References: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> Message-ID: On Mon, 10 Nov 2025 10:26:43 GMT, Radim Vansa wrote: >> The current tests for listening socket reopen have two defects: >> * there is no thread calling accept() during the checkpoint, while normally servers would be waiting for new connections >> * non-blocking server sockets where `Selector.select()` is used for the blocking is not exercised at all >> >> As a result the scenario with Selector.select() fails to even close such server socket, due to FDs not in the `EpollSelectorImpl` not being processed after the selection keys are cancelled. >> >> Second problem is the reopen: servers can use long-lived key with `SelectionKey.OP_ACCEPT` interest. When the server socket is closed the key is cancelled; to handle this transparently we need to make the key valid again and register it with the channel and selector. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Disable selector test on Linux A test fails on Windows ? should it be marked Linux-only until JDK-8371549 is resolved? ------------- PR Review: https://git.openjdk.org/crac/pull/275#pullrequestreview-3452311563 From mz1999 at gmail.com Wed Nov 12 09:29:27 2025 From: mz1999 at gmail.com (ma zhen) Date: Wed, 12 Nov 2025 17:29:27 +0800 Subject: CRaC: CheckpointException with file descriptors from JVM internals and native calls Message-ID: Hi everyone, I'm encountering a CheckpointException when creating a checkpoint image with CRaC. The root cause is that the application holds file descriptors for files or directories. Our application is quite complex, and after some investigation, I've found that these files/directories are being opened by third-party libraries. The challenge is that they are not opened through regular file I/O APIs, which makes it impossible to handle them using File Descriptor Policies. I've identified two specific scenarios: 1. A third-party library periodically fetches system resource information, which includes calling `OperatingSystemMXBean.getAvailableProcessors`. When the JVM determines the number of available CPU cores, if it detects that cgroups are available, it will read the resource limit file `cpu.cfs_quota_us`, even if the process is not in a container. The specific implementation logic can be found in cgroupV1Subsystem_linux.cpp: ( https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp ) If a checkpoint is triggered at this exact moment, an exception similar to the following occurs: Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=57 type=regular path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115) at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189) at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315) at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328) 2. For some reason, a third-party library periodically calls `File.list` to get the list of files in a specific directory. On Linux, the `list` method eventually calls the JNI method `Java_java_io_UnixFileSystem_list` which holds a directory file descriptor during its execution. This is defined in UnixFileSystem_md.c: ( https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c ) Similarly, if a checkpoint is triggered at this moment, an exception like the one below is thrown: jdk.internal.crac.mirror.CheckpointException Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=46 type=directory path=.../WEB-INF/classes/WEB-INF/services at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115) at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189) at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315) at java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328) In both situations, if a checkpoint coincides with the execution of these periodic tasks, the checkpoint is likely to fail. My current workaround is to attempt the checkpoint multiple times, as it will eventually succeed. While this allows me to bypass the issue, I would like to know if there is a more optimal solution. Thank you. Best regards, mazhen -------------- next part -------------- An HTML attachment was scrubbed... URL: From duke at openjdk.org Wed Nov 12 11:10:41 2025 From: duke at openjdk.org (mazhen) Date: Wed, 12 Nov 2025 11:10:41 GMT Subject: [crac] RFR: 8371462: [CRaC] Improve listening socket reopen through FD policies [v2] In-Reply-To: References: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> Message-ID: On Mon, 10 Nov 2025 10:26:43 GMT, Radim Vansa wrote: >> The current tests for listening socket reopen have two defects: >> * there is no thread calling accept() during the checkpoint, while normally servers would be waiting for new connections >> * non-blocking server sockets where `Selector.select()` is used for the blocking is not exercised at all >> >> As a result the scenario with Selector.select() fails to even close such server socket, due to FDs not in the `EpollSelectorImpl` not being processed after the selection keys are cancelled. >> >> Second problem is the reopen: servers can use long-lived key with `SelectionKey.OP_ACCEPT` interest. When the server socket is closed the key is cancelled; to handle this transparently we need to make the key valid again and register it with the channel and selector. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Disable selector test on Linux I wanted to share a thought on the current implementation. The present design affects all SelectionKey cancel operations, not just CRaC scenarios. An alternative worth considering would be to relocate the reopen callback registration logic into `ServerSocketChannelImpl.closeBeforeCheckpoint()`. This method lives squarely within the CRaC lifecycle, making it a more natural home for such operations. Just before the channel closes, we could iterate through its associated SelectionKeys and register the necessary restoration callbacks: // In ServerSocketChannelImpl.closeBeforeCheckpoint() @Override protected void closeBeforeCheckpoint() throws IOException { // Register reopen callbacks for all associated keys before closing forEach(key -> { if (key instanceof SelectionKeyImpl ski && ski.isValid()) { Consumer enqueue = JDKSocketResourceBase.reopenQueue(this); if (enqueue != null) { enqueue.accept(() -> { // Reopen logic: revalidate key, re-register with selector, etc. }); } } }); close(); // Then proceed with normal close } The implementation requires adjust AbstractSelectableChannel.forEach visibility to protected. ------------- PR Comment: https://git.openjdk.org/crac/pull/275#issuecomment-3521385404 From rvansa at openjdk.org Thu Nov 13 08:33:49 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 13 Nov 2025 08:33:49 GMT Subject: [crac] RFR: 8371462: [CRaC] Improve listening socket reopen through FD policies [v2] In-Reply-To: References: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> Message-ID: On Wed, 12 Nov 2025 08:58:00 GMT, Timofei Pushkin wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Disable selector test on Linux > > A test fails on Windows ? should it be marked Linux-only until JDK-8371549 is resolved? @TimPushkin I'll fix the Windows failure, already stumbled upon that in https://bugs.openjdk.org/browse/JDK-8371549 ------------- PR Comment: https://git.openjdk.org/crac/pull/275#issuecomment-3526421009 From rvansa at openjdk.org Thu Nov 13 09:06:00 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 13 Nov 2025 09:06:00 GMT Subject: [crac] RFR: 8371462: [CRaC] Improve listening socket reopen through FD policies [v2] In-Reply-To: References: <6boKY59T6obbYs8k9MV1TcIOW7wj3gcW3-5bULTFc30=.dfcf5952-992a-44d7-832c-f3e632cc8069@github.com> Message-ID: On Wed, 12 Nov 2025 11:08:03 GMT, mazhen wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Disable selector test on Linux > > I wanted to share a thought on the current implementation. > > The present design affects all SelectionKey cancel operations, not just CRaC scenarios. > > An alternative worth considering would be to relocate the reopen callback registration logic into `ServerSocketChannelImpl.closeBeforeCheckpoint()`. This method lives squarely within the CRaC lifecycle, making it a more natural home for such operations. Just before the channel closes, we could iterate through its associated SelectionKeys and register the necessary restoration callbacks: > > > // In ServerSocketChannelImpl.closeBeforeCheckpoint() > @Override > protected void closeBeforeCheckpoint() throws IOException { > // Register reopen callbacks for all associated keys before closing > forEach(key -> { > if (key instanceof SelectionKeyImpl ski && ski.isValid()) { > Consumer enqueue = JDKSocketResourceBase.reopenQueue(this); > if (enqueue != null) { > enqueue.accept(() -> { > // Reopen logic: revalidate key, re-register with selector, et... @mz1999 As you noticed, we would need to somehow expose `AbstractSelectableChannel.forEach` - I would be very careful about changing a visibility of a method in a public class (in a public package). That could be a breaking change if an inheriting class already has such method. We could make that package-private and access that through `SharedSecrets`, though. And I wanted to avoid extra method if I could avoid it - the overhead in `cancel()` of checking a key in an empty CHM seemed acceptable, for non-CRaC scenarios. I can try how your approach looks like, though. ------------- PR Comment: https://git.openjdk.org/crac/pull/275#issuecomment-3526598500 From rvansa at openjdk.org Fri Nov 14 07:02:08 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 14 Nov 2025 07:02:08 GMT Subject: [crac] RFR: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore In-Reply-To: References: Message-ID: On Wed, 12 Nov 2025 08:23:53 GMT, Timofei Pushkin wrote: >> A JVM that executed OperatingSystemImpl.getProcessCpuLoad() before checkpoint can fail with assertion failure after restore with: >> >> java: /home/rvansa/work/zulu/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c:291: get_cpuload_internal: Assertion `pticks->usedKernel >= tmp.usedKernel' failed. >> >> This is an assertion failure, therefore failing only in debug builds, and providing probably a non-sense value in release builds. We should remove the assertion and return a negative value (documented as value for ?unavailable?) if this is detected. > > src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c line 290: > >> 288: // After restoring with CRaC the new process can appear 'younger' >> 289: // than last value in counters - we will return -1 (unavailable). >> 290: if (!failed && pticks->usedKernel >= tmp.usedKernel) { > > I haven't read the precise definitions of `used`, `usedKernel`, `total`, is checking only `usedKernel` enough to guarantee the comparison is also valid for the other two values? Shouldn't we check both `used` and `usedKernel` (assuming `total` is exactly their sum, which may not be the case) in case only one of them has gone backwards? The whole check is quite indeterministic: if we consider this is called at arbitrary point, the value of `usedKernel` can be just right and the results would not be correct anyway. I think that it's important that at the end the result is sanitized to a safe range 0 - 1. If we consider a more realistic use case, e.g. calling this every 5 seconds, I would expect that before checkpoint the application spent significantly more time before checkpoint than it did until the first hit after restore, so this will detect potentially invalid result. My main motivation was to avoid the crash in debug builds. We could perfect this by adding a hook that will reset the values during C/R but I would opt for a minimalistic change. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/274#discussion_r2526033290 From mz1999 at gmail.com Fri Nov 14 08:01:45 2025 From: mz1999 at gmail.com (ma zhen) Date: Fri, 14 Nov 2025 16:01:45 +0800 Subject: CRaC: CheckpointException with file descriptors from JVM internals and native calls In-Reply-To: References: Message-ID: Hi everyone, Following up on my own question, I believe I've found a suitable solution and wanted to share it for the archives. The issue was resolved using the VM option `-XX:CRaCAllowedOpenFilePrefixes`. This option lets you specify a comma-separated list of path prefixes that CRaC should ignore if they are found open during a checkpoint. (Reference: https://docs.azul.com/crac/usage/vm-options) Crucially, and what makes it a perfect solution for my original problem, is that this option works for files opened by native code (e.g., via JNI or internal JVM functions). This is why it can handle the file descriptors that were not manageable through standard CRaC resource policies. This directly addresses the two scenarios I described: 1. For the cgroup file opened by `OperatingSystemMXBean`, I can now add `/sys/fs/cgroup/` to the allowed prefixes. 2. For the directory descriptor held open by the native implementation of `File.list`, adding the application's base path works perfectly. This provides a much more robust solution than retrying the checkpoint. I hope this is helpful for anyone else running into similar issues. Best regards, mazhen ma zhen ?2025?11?12??? 17:29??? > Hi everyone, > > I'm encountering a CheckpointException when creating a checkpoint image > with CRaC. The root cause is that the application holds file descriptors > for files or directories. > > Our application is quite complex, and after some investigation, I've found > that these files/directories are being opened by third-party libraries. > The challenge is that they are not opened through regular file I/O APIs, > which makes it impossible to handle them using File Descriptor Policies. > > I've identified two specific scenarios: > > 1. A third-party library periodically fetches system resource information, > which includes calling `OperatingSystemMXBean.getAvailableProcessors`. > > When the JVM determines the number of available CPU cores, if it detects > that cgroups are available, it will read the resource limit file > `cpu.cfs_quota_us`, even if the process is not in a container. > The specific implementation logic can be found in > cgroupV1Subsystem_linux.cpp: > ( > https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp > ) > > If a checkpoint is triggered at this exact moment, an exception > similar to the following occurs: > > Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: > FD fd=57 type=regular > path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us > at > java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115) > at > java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189) > at > java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315) > at > java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328) > > 2. For some reason, a third-party library periodically calls `File.list` > to get the list of files in a specific directory. > > On Linux, the `list` method eventually calls the JNI method > `Java_java_io_UnixFileSystem_list` which holds a directory file > descriptor during its execution. This is defined in UnixFileSystem_md.c: > ( > https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c > ) > > Similarly, if a checkpoint is triggered at this moment, an exception > like the one below is thrown: > > jdk.internal.crac.mirror.CheckpointException > Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: > FD fd=46 type=directory path=.../WEB-INF/classes/WEB-INF/services > at > java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115) > at > java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189) > at > java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315) > at > java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328) > > > In both situations, if a checkpoint coincides with the execution of these > periodic tasks, the checkpoint is likely to fail. > > My current workaround is to attempt the checkpoint multiple times, as it > will eventually succeed. While this allows me to bypass the issue, I would > like to know if there is a more optimal solution. > > Thank you. > > Best regards, > mazhen > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tpushkin at openjdk.org Fri Nov 14 08:23:41 2025 From: tpushkin at openjdk.org (Timofei Pushkin) Date: Fri, 14 Nov 2025 08:23:41 GMT Subject: [crac] RFR: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore In-Reply-To: References: Message-ID: On Fri, 14 Nov 2025 06:59:00 GMT, Radim Vansa wrote: >> src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c line 290: >> >>> 288: // After restoring with CRaC the new process can appear 'younger' >>> 289: // than last value in counters - we will return -1 (unavailable). >>> 290: if (!failed && pticks->usedKernel >= tmp.usedKernel) { >> >> I haven't read the precise definitions of `used`, `usedKernel`, `total`, is checking only `usedKernel` enough to guarantee the comparison is also valid for the other two values? Shouldn't we check both `used` and `usedKernel` (assuming `total` is exactly their sum, which may not be the case) in case only one of them has gone backwards? > > The whole check is quite indeterministic: if we consider this is called at arbitrary point, the value of `usedKernel` can be just right and the results would not be correct anyway. I think that it's important that at the end the result is sanitized to a safe range 0 - 1. > If we consider a more realistic use case, e.g. calling this every 5 seconds, I would expect that before checkpoint the application spent significantly more time before checkpoint than it did until the first hit after restore, so this will detect potentially invalid result. > My main motivation was to avoid the crash in debug builds. We could perfect this by adding a hook that will reset the values during C/R but I would opt for a minimalistic change. To me it looks like we should either check all three `used*` variables (because they are not fully dependent and the checks are cheap) or not check any of them (because checking does not guarantee correctness anyway) ------------- PR Review Comment: https://git.openjdk.org/crac/pull/274#discussion_r2526406827 From rvansa at openjdk.org Fri Nov 14 09:49:56 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 14 Nov 2025 09:49:56 GMT Subject: [crac] RFR: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore In-Reply-To: References: Message-ID: On Fri, 14 Nov 2025 08:20:48 GMT, Timofei Pushkin wrote: >> The whole check is quite indeterministic: if we consider this is called at arbitrary point, the value of `usedKernel` can be just right and the results would not be correct anyway. I think that it's important that at the end the result is sanitized to a safe range 0 - 1. >> If we consider a more realistic use case, e.g. calling this every 5 seconds, I would expect that before checkpoint the application spent significantly more time before checkpoint than it did until the first hit after restore, so this will detect potentially invalid result. >> My main motivation was to avoid the crash in debug builds. We could perfect this by adding a hook that will reset the values during C/R but I would opt for a minimalistic change. > > To me it looks like we should either check all three `used*` variables (because they are not fully dependent and the checks are cheap) or not check any of them (because checking does not guarantee correctness anyway) Ok, it won't hurt. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/274#discussion_r2526771510 From rvansa at azul.com Tue Nov 18 18:26:34 2025 From: rvansa at azul.com (Radim Vansa) Date: Tue, 18 Nov 2025 19:26:34 +0100 Subject: CRaC: CheckpointException with file descriptors from JVM internals and native calls In-Reply-To: References: Message-ID: <19e9d8b4-ed83-4185-83e5-dce3f1f1a398@azul.com> Hello ma zhen, apologies for an untimely response. In general, both FD policies and CRaCAllowedOpenFilePrefixes are really a workaround for apps that don't adhere to CRaC requirements, rather than a proper solutions. But let's talk about the problems individually: 1) When it comes to getAvailableProcessors() I think that opening the cgroups info is an implementation detail, and CRaC JVM should handle that transparently. There should be a hook (either in Java code or in native, whichever is less intrusive) that will make the file access and C/R mutually exclusive. We will gladly accept a PR (with a test case, please). 2) Listing files is an interaction with the environment, and application should stop that during C/R. Your observation about FD policies makes sense; in fact in this case there is no resource that could be linked into the FD policies; we would have to explicitly synchronize with C/R and that would be expensive on such a common function. From practical POV I understand that you can't easily modify the 3rd party library and I am glad that it works for you. Note though, that?CRaCAllowedOpenFilePrefixes basically relies on C/R engine to handle that FD correctly. And if you attempt to restore on a system that does not host this directory, the restore will fail. Technically the getAvailableProcessors() is also an interaction with the 'environment', with the machine it is currently running, but the world is not black and white and my opinion is that this should be transparent. Radim On 11/14/25 09:01, ma zhen wrote: > > > Caution: This email originated from outside of the organization. Do > not click links or open attachments unless you recognize the sender > and know the content is safe. > > > Hi everyone, > > Following up on my own question, I believe I've found a suitable > solution and wanted to share it for the archives. > > The issue was resolved using the VM option > `-XX:CRaCAllowedOpenFilePrefixes`. This option lets you specify a > comma-separated list of path prefixes that CRaC should ignore if they > are found open during a checkpoint. > > (Reference: https://docs.azul.com/crac/usage/vm-options) > > Crucially, and what makes it a perfect solution for my original > problem, is that this option works for files opened by native code > (e.g., via JNI or internal JVM functions). This is why it can handle > the file descriptors that were not manageable through standard CRaC > resource policies. > > This directly addresses the two scenarios I described: > > 1. For the cgroup file opened by `OperatingSystemMXBean`, I can now add > ? ?`/sys/fs/cgroup/` to the allowed prefixes. > > 2. For the directory descriptor held open by the native implementation of > ? ?`File.list`, adding the application's base path works perfectly. > > This provides a much more robust solution than retrying the > checkpoint. I hope this is helpful for anyone else running into > similar issues. > > Best regards, > mazhen > > ma zhen ?2025?11?12??? 17:29??? > > Hi everyone, > > I'm encountering a CheckpointException when creating a checkpoint > image > with CRaC. The root cause is that the application holds file > descriptors > for files or directories. > > Our application is quite complex, and after some investigation, > I've found > that these files/directories are being opened by third-party > libraries. > The challenge is that they are not opened through regular file I/O > APIs, > which makes it impossible to handle them using File Descriptor > Policies. > > I've identified two specific scenarios: > > 1. A third-party library periodically fetches system resource > information, > ? ?which includes calling > `OperatingSystemMXBean.getAvailableProcessors`. > > ? ?When the JVM determines the number of available CPU cores, if > it detects > ? ?that cgroups are available, it will read the resource limit file > ? ?`cpu.cfs_quota_us`, even if the process is not in a container. > ? ?The specific implementation logic can be found in > cgroupV1Subsystem_linux.cpp: > ? > ?(https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp) > > ? ?If a checkpoint is triggered at this exact moment, an exception > ? ?similar to the following occurs: > > ? ? Suppressed: > jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD > fd=57 type=regular > path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us > ? ? ? ? at > java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115) > ? ? ? ? at > java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189) > ? ? ? ? at > java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315) > ? ? ? ? at > java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328) > > 2. For some reason, a third-party library periodically calls > `File.list` > ? ?to get the list of files in a specific directory. > > ? ?On Linux, the `list` method eventually calls the JNI method > ? ?`Java_java_io_UnixFileSystem_list` which holds a directory file > ? ?descriptor during its execution. This is defined in > UnixFileSystem_md.c: > ? > ?(https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c) > > ? ?Similarly, if a checkpoint is triggered at this moment, an > exception > ? ?like the one below is thrown: > > jdk.internal.crac.mirror.CheckpointException > ? ? Suppressed: > jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD > fd=46 type=directory path=.../WEB-INF/classes/WEB-INF/services > ? ? ? ? at > java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115) > ? ? ? ? at > java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189) > ? ? ? ? at > java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315) > ? ? ? ? at > java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328) > > > In both situations, if a checkpoint coincides with the execution > of these > periodic tasks, the checkpoint is likely to fail. > > My current workaround is to attempt the checkpoint multiple times, > as it > will eventually succeed. While this allows me to bypass the issue, > I would > like to know if there is a more optimal solution. > > Thank you. > > Best regards, > mazhen > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mz1999 at gmail.com Fri Nov 21 07:12:15 2025 From: mz1999 at gmail.com (ma zhen) Date: Fri, 21 Nov 2025 15:12:15 +0800 Subject: CRaC: CheckpointException with file descriptors from JVM internals and native calls In-Reply-To: <19e9d8b4-ed83-4185-83e5-dce3f1f1a398@azul.com> References: <19e9d8b4-ed83-4185-83e5-dce3f1f1a398@azul.com> Message-ID: Hi Radim, Thank you for your detailed and candid feedback. I fully agree with your assessment regarding both scenarios. You've clearly articulated why FD policies and CRaCAllowedOpenFilePrefixes are workarounds, and that a more transparent solution for JVM internals like getAvailableProcessors() is indeed the proper way forward. Regarding the getAvailableProcessors() issue and your suggestion for a PR, my current thinking is to introduce a lightweight synchronization mechanism in the native CRaC code. This would involve an RAII-style guard to mark the critical section during cgroup file access, ensuring mutual exclusion with checkpoint operations. I would be glad to attempt implementing this and contributing a PR with a test case. Best regards, mazhen Radim Vansa ?2025?11?19??? 05:13??? > Hello ma zhen, > > apologies for an untimely response. > > In general, both FD policies and CRaCAllowedOpenFilePrefixes are really a > workaround for apps that don't adhere to CRaC requirements, rather than a > proper solutions. But let's talk about the problems individually: > > 1) When it comes to getAvailableProcessors() I think that opening the > cgroups info is an implementation detail, and CRaC JVM should handle that > transparently. There should be a hook (either in Java code or in native, > whichever is less intrusive) that will make the file access and C/R > mutually exclusive. We will gladly accept a PR (with a test case, please). > > 2) Listing files is an interaction with the environment, and application > should stop that during C/R. Your observation about FD policies makes > sense; in fact in this case there is no resource that could be linked into > the FD policies; we would have to explicitly synchronize with C/R and that > would be expensive on such a common function. From practical POV I > understand that you can't easily modify the 3rd party library and I am glad > that it works for you. Note though, that CRaCAllowedOpenFilePrefixes > basically relies on C/R engine to handle that FD correctly. And if you > attempt to restore on a system that does not host this directory, the > restore will fail. > > Technically the getAvailableProcessors() is also an interaction with the > 'environment', with the machine it is currently running, but the world is > not black and white and my opinion is that this should be transparent. > > Radim > On 11/14/25 09:01, ma zhen wrote: > > > Caution: This email originated from outside of the organization. Do not > click links or open attachments unless you recognize the sender and know > the content is safe. > > Hi everyone, > > Following up on my own question, I believe I've found a suitable solution > and wanted to share it for the archives. > > The issue was resolved using the VM option > `-XX:CRaCAllowedOpenFilePrefixes`. This option lets you specify a > comma-separated list of path prefixes that CRaC should ignore if they are > found open during a checkpoint. > > (Reference: https://docs.azul.com/crac/usage/vm-options) > > Crucially, and what makes it a perfect solution for my original problem, > is that this option works for files opened by native code (e.g., via JNI or > internal JVM functions). This is why it can handle the file descriptors > that were not manageable through standard CRaC resource policies. > > This directly addresses the two scenarios I described: > > 1. For the cgroup file opened by `OperatingSystemMXBean`, I can now add > `/sys/fs/cgroup/` to the allowed prefixes. > > 2. For the directory descriptor held open by the native implementation of > `File.list`, adding the application's base path works perfectly. > > This provides a much more robust solution than retrying the checkpoint. I > hope this is helpful for anyone else running into similar issues. > > Best regards, > mazhen > > ma zhen ?2025?11?12??? 17:29??? > >> Hi everyone, >> >> I'm encountering a CheckpointException when creating a checkpoint image >> with CRaC. The root cause is that the application holds file descriptors >> for files or directories. >> >> Our application is quite complex, and after some investigation, I've >> found >> that these files/directories are being opened by third-party libraries. >> The challenge is that they are not opened through regular file I/O APIs, >> which makes it impossible to handle them using File Descriptor Policies. >> >> I've identified two specific scenarios: >> >> 1. A third-party library periodically fetches system resource information, >> which includes calling `OperatingSystemMXBean.getAvailableProcessors`. >> >> When the JVM determines the number of available CPU cores, if it >> detects >> that cgroups are available, it will read the resource limit file >> `cpu.cfs_quota_us`, even if the process is not in a container. >> The specific implementation logic can be found in >> cgroupV1Subsystem_linux.cpp: >> ( >> https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp >> ) >> >> If a checkpoint is triggered at this exact moment, an exception >> similar to the following occurs: >> >> Suppressed: >> jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=57 >> type=regular path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us >> at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions( >> Core.java:115) >> at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1( >> Core.java:189) >> at java.base/jdk.internal.crac.mirror.Core.checkpointRestore( >> Core.java:315) >> at >> java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal( >> Core.java:328) >> >> 2. For some reason, a third-party library periodically calls `File.list` >> to get the list of files in a specific directory. >> >> On Linux, the `list` method eventually calls the JNI method >> `Java_java_io_UnixFileSystem_list` which holds a directory file >> descriptor during its execution. This is defined in >> UnixFileSystem_md.c: >> ( >> https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c >> ) >> >> Similarly, if a checkpoint is triggered at this moment, an exception >> like the one below is thrown: >> >> jdk.internal.crac.mirror.CheckpointException >> Suppressed: >> jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=46 >> type=directory path=.../WEB-INF/classes/WEB-INF/services >> at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions( >> Core.java:115) >> at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1( >> Core.java:189) >> at java.base/jdk.internal.crac.mirror.Core.checkpointRestore( >> Core.java:315) >> at >> java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal( >> Core.java:328) >> >> >> In both situations, if a checkpoint coincides with the execution of these >> periodic tasks, the checkpoint is likely to fail. >> >> My current workaround is to attempt the checkpoint multiple times, as it >> will eventually succeed. While this allows me to bypass the issue, I would >> like to know if there is a more optimal solution. >> >> Thank you. >> >> Best regards, >> mazhen >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rvansa at openjdk.org Fri Nov 21 07:37:31 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 21 Nov 2025 07:37:31 GMT Subject: [crac] RFR: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore [v2] In-Reply-To: References: Message-ID: > A JVM that executed OperatingSystemImpl.getProcessCpuLoad() before checkpoint can fail with assertion failure after restore with: > > java: /home/rvansa/work/zulu/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c:291: get_cpuload_internal: Assertion `pticks->usedKernel >= tmp.usedKernel' failed. > > This is an assertion failure, therefore failing only in debug builds, and providing probably a non-sense value in release builds. We should remove the assertion and return a negative value (documented as value for ?unavailable?) if this is detected. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Check all partial counters ------------- Changes: - all: https://git.openjdk.org/crac/pull/274/files - new: https://git.openjdk.org/crac/pull/274/files/284b7130..23d66dda Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=274&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=274&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/274.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/274/head:pull/274 PR: https://git.openjdk.org/crac/pull/274 From rvansa at azul.com Fri Nov 21 16:37:13 2025 From: rvansa at azul.com (Radim Vansa) Date: Fri, 21 Nov 2025 17:37:13 +0100 Subject: CRaC: CheckpointException with file descriptors from JVM internals and native calls In-Reply-To: References: <19e9d8b4-ed83-4185-83e5-dce3f1f1a398@azul.com> Message-ID: <83ddc074-7ba8-4346-a469-be8baf83f573@azul.com> That's great, while currently similar hooks don't use RAII I think that's a reliable way to implement this. Please make sure that your implementation uses RW locking, not forcing mutual exclusion and unintended synchronization when the checkpoint is not happening. Alternatively, it might be possible to mark the entry to this section as critical and prevent VM thread from executing the C/R; I am not sure which alternative is more lightweight. Thanks in advance for the contribution! Radim On 11/21/25 08:12, ma zhen wrote: > > > Caution: This email originated from outside of the organization. Do > not click links or open attachments unless you recognize the sender > and know the content is safe. > > > Hi Radim, > > Thank you for your detailed and candid feedback. > > I fully agree with your assessment regarding both scenarios. You've > clearly articulated why FD policies and CRaCAllowedOpenFilePrefixes > are workarounds, and that a more transparent solution for JVM > internals like getAvailableProcessors() is indeed the proper way forward. > > Regarding the getAvailableProcessors() issue and your suggestion for a > PR, my current thinking is to introduce a lightweight synchronization > mechanism in the native CRaC code. This would involve an RAII-style > guard to mark the critical section during cgroup file access, ensuring > mutual exclusion with checkpoint operations. > > I would be glad to attempt implementing this and contributing a PR > with a test case. > > Best regards, > mazhen > > Radim Vansa ?2025?11?19??? 05:13??? > > Hello ma zhen, > > apologies for an untimely response. > > In general, both FD policies and CRaCAllowedOpenFilePrefixes are > really a workaround for apps that don't adhere to CRaC > requirements, rather than a proper solutions. But let's talk about > the problems individually: > > 1) When it comes to getAvailableProcessors() I think that opening > the cgroups info is an implementation detail, and CRaC JVM should > handle that transparently. There should be a hook (either in Java > code or in native, whichever is less intrusive) that will make the > file access and C/R mutually exclusive. We will gladly accept a PR > (with a test case, please). > > 2) Listing files is an interaction with the environment, and > application should stop that during C/R. Your observation about FD > policies makes sense; in fact in this case there is no resource > that could be linked into the FD policies; we would have to > explicitly synchronize with C/R and that would be expensive on > such a common function. From practical POV I understand that you > can't easily modify the 3rd party library and I am glad that it > works for you. Note though, that?CRaCAllowedOpenFilePrefixes > basically relies on C/R engine to handle that FD correctly. And if > you attempt to restore on a system that does not host this > directory, the restore will fail. > > Technically the getAvailableProcessors() is also an interaction > with the 'environment', with the machine it is currently running, > but the world is not black and white and my opinion is that this > should be transparent. > > Radim > > On 11/14/25 09:01, ma zhen wrote: >> >> >> Caution: This email originated from outside of the organization. >> Do not click links or open attachments unless you recognize the >> sender and know the content is safe. >> >> >> Hi everyone, >> >> Following up on my own question, I believe I've found a suitable >> solution and wanted to share it for the archives. >> >> The issue was resolved using the VM option >> `-XX:CRaCAllowedOpenFilePrefixes`. This option lets you specify a >> comma-separated list of path prefixes that CRaC should ignore if >> they are found open during a checkpoint. >> >> (Reference: https://docs.azul.com/crac/usage/vm-options) >> >> Crucially, and what makes it a perfect solution for my original >> problem, is that this option works for files opened by native >> code (e.g., via JNI or internal JVM functions). This is why it >> can handle the file descriptors that were not manageable through >> standard CRaC resource policies. >> >> This directly addresses the two scenarios I described: >> >> 1. For the cgroup file opened by `OperatingSystemMXBean`, I can >> now add >> ? ?`/sys/fs/cgroup/` to the allowed prefixes. >> >> 2. For the directory descriptor held open by the native >> implementation of >> ? ?`File.list`, adding the application's base path works perfectly. >> >> This provides a much more robust solution than retrying the >> checkpoint. I hope this is helpful for anyone else running into >> similar issues. >> >> Best regards, >> mazhen >> >> ma zhen ?2025?11?12??? 17:29??? >> >> Hi everyone, >> >> I'm encountering a CheckpointException when creating a >> checkpoint image >> with CRaC. The root cause is that the application holds file >> descriptors >> for files or directories. >> >> Our application is quite complex, and after some >> investigation, I've found >> that these files/directories are being opened by third-party >> libraries. >> The challenge is that they are not opened through regular >> file I/O APIs, >> which makes it impossible to handle them using File >> Descriptor Policies. >> >> I've identified two specific scenarios: >> >> 1. A third-party library periodically fetches system resource >> information, >> ? ?which includes calling >> `OperatingSystemMXBean.getAvailableProcessors`. >> >> ? ?When the JVM determines the number of available CPU cores, >> if it detects >> ? ?that cgroups are available, it will read the resource >> limit file >> ? ?`cpu.cfs_quota_us`, even if the process is not in a container. >> ? ?The specific implementation logic can be found in >> cgroupV1Subsystem_linux.cpp: >> ? >> ?(https://github.com/openjdk/crac/blob/crac/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp) >> >> ? ?If a checkpoint is triggered at this exact moment, an >> exception >> ? ?similar to the following occurs: >> >> ? ? Suppressed: >> jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD >> fd=57 type=regular >> path=/sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.cfs_quota_us >> ? ? ? ? at >> java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115) >> ? ? ? ? at >> java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189) >> ? ? ? ? at >> java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315) >> ? ? ? ? at >> java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328) >> >> 2. For some reason, a third-party library periodically calls >> `File.list` >> ? ?to get the list of files in a specific directory. >> >> ? ?On Linux, the `list` method eventually calls the JNI method >> ?`Java_java_io_UnixFileSystem_list` which holds a directory file >> ? ?descriptor during its execution. This is defined in >> UnixFileSystem_md.c: >> ? >> ?(https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/libjava/UnixFileSystem_md.c) >> >> ? ?Similarly, if a checkpoint is triggered at this moment, an >> exception >> ? ?like the one below is thrown: >> >> jdk.internal.crac.mirror.CheckpointException >> ? ? Suppressed: >> jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD >> fd=46 type=directory path=.../WEB-INF/classes/WEB-INF/services >> ? ? ? ? at >> java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115) >> ? ? ? ? at >> java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:189) >> ? ? ? ? at >> java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:315) >> ? ? ? ? at >> java.base/jdk.internal.crac.mirror.Core.checkpointRestoreInternal(Core.java:328) >> >> >> In both situations, if a checkpoint coincides with the >> execution of these >> periodic tasks, the checkpoint is likely to fail. >> >> My current workaround is to attempt the checkpoint multiple >> times, as it >> will eventually succeed. While this allows me to bypass the >> issue, I would >> like to know if there is a more optimal solution. >> >> Thank you. >> >> Best regards, >> mazhen >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From rvansa at openjdk.org Fri Nov 21 17:36:12 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 21 Nov 2025 17:36:12 GMT Subject: [crac] RFR: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore [v2] In-Reply-To: References: Message-ID: On Fri, 21 Nov 2025 07:37:31 GMT, Radim Vansa wrote: >> A JVM that executed OperatingSystemImpl.getProcessCpuLoad() before checkpoint can fail with assertion failure after restore with: >> >> java: /home/rvansa/work/zulu/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c:291: get_cpuload_internal: Assertion `pticks->usedKernel >= tmp.usedKernel' failed. >> >> This is an assertion failure, therefore failing only in debug builds, and providing probably a non-sense value in release builds. We should remove the assertion and return a negative value (documented as value for ?unavailable?) if this is detected. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Check all partial counters @TimPushkin Updated (I forgot to push the change earlier), please review. ------------- PR Comment: https://git.openjdk.org/crac/pull/274#issuecomment-3564007205 From tpushkin at openjdk.org Sat Nov 22 12:21:51 2025 From: tpushkin at openjdk.org (Timofei Pushkin) Date: Sat, 22 Nov 2025 12:21:51 GMT Subject: [crac] RFR: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore [v2] In-Reply-To: References: Message-ID: On Fri, 21 Nov 2025 07:37:31 GMT, Radim Vansa wrote: >> A JVM that executed OperatingSystemImpl.getProcessCpuLoad() before checkpoint can fail with assertion failure after restore with: >> >> java: /home/rvansa/work/zulu/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c:291: get_cpuload_internal: Assertion `pticks->usedKernel >= tmp.usedKernel' failed. >> >> This is an assertion failure, therefore failing only in debug builds, and providing probably a non-sense value in release builds. We should remove the assertion and return a negative value (documented as value for ?unavailable?) if this is detected. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Check all partial counters Marked as reviewed by tpushkin (Committer). ------------- PR Review: https://git.openjdk.org/crac/pull/274#pullrequestreview-3496728637 From rvansa at openjdk.org Mon Nov 24 08:16:44 2025 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 24 Nov 2025 08:16:44 GMT Subject: [crac] Integrated: 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore In-Reply-To: References: Message-ID: On Wed, 5 Nov 2025 12:21:23 GMT, Radim Vansa wrote: > A JVM that executed OperatingSystemImpl.getProcessCpuLoad() before checkpoint can fail with assertion failure after restore with: > > java: /home/rvansa/work/zulu/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c:291: get_cpuload_internal: Assertion `pticks->usedKernel >= tmp.usedKernel' failed. > > This is an assertion failure, therefore failing only in debug builds, and providing probably a non-sense value in release builds. We should remove the assertion and return a negative value (documented as value for ?unavailable?) if this is detected. This pull request has now been integrated. Changeset: e288b416 Author: Radim Vansa URL: https://git.openjdk.org/crac/commit/e288b416f0ffa0c1031188fd3a34e582dd00cc19 Stats: 65 lines in 2 files changed: 62 ins; 3 del; 0 mod 8371337: [CRaC] Fastdebug build fails when calling OperatingSystemMxBean.getProcessCpuLoad after restore Reviewed-by: tpushkin ------------- PR: https://git.openjdk.org/crac/pull/274