[crac] RFR: [CRaC] Support checkpointing with --patch-module by treating patch JARs as persistent
mazhen
duke at openjdk.org
Mon Aug 4 18:42:15 UTC 2025
On Wed, 16 Jul 2025 06:31:34 GMT, Timofei Pushkin <tpushkin at openjdk.org> wrote:
>> #### Summary
>>
>> This change enhances CRaC's checkpointing capabilities to correctly handle applications launched with the `--patch-module` option. Previously, using a JAR file with `--patch-module` would cause checkpointing to fail due to unmanaged file descriptors. This patch resolves the issue by treating these JARs as persistent resources, similar to how classpath JARs are handled.
>>
>> #### Problem
>>
>> When a JAR file is specified via `--patch-module`, it is opened by two distinct mechanisms within the JVM:
>>
>> 1. **Native `ClassLoader`**: The C++ `ClassLoader` implementation opens the JAR file to build its internal class path structures. This results in a file descriptor that is not managed by CRaC's Java-level resource tracking, leading to it being flagged as a "BAD: opened by application" file during `check_fds`.
>> 2. **Java `ModulePatcher`**: The Java-level `jdk.internal.module.ModulePatcher` also opens the same JAR to scan for packages and resources. While this could be made CRaC-aware, the native-level file descriptor remains an issue.
>>
>> This dual-opening mechanism leads to `CheckpointOpenFileException` being thrown for both file descriptors during a checkpoint, preventing a successful image creation. The log output clearly illustrates this problem:
>>
>>
>> $ jcmd 3403046 JDK.checkpoint
>> 3403046:
>> ...
>> JVM: FD fd=4 type=regular path="/home/mazhen/works/patch-demo/legacy-patch.jar" BAD: opened by application
>> JVM: FD fd=5 type=regular path="/home/mazhen/works/patch-demo/legacy-patch.jar" OK: claimed by java code
>> ...
>> An exception during a checkpoint operation:
>> jdk.internal.crac.mirror.CheckpointException
>> Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: legacy-patch.jar
>> at java.base/jdk.internal.crac.JDKFileResource.lambda$beforeCheckpoint$0(JDKFileResource.java:90)
>> ...
>> Caused by: java.lang.Exception: This file descriptor was created by main at epoch:1752027938791 here
>> at java.base/jdk.internal.crac.JDKFdResource.<init>(JDKFdResource.java:40)
>> ...
>> at java.base/jdk.internal.module.ModulePatcher$JarResourceFinder.<init>(ModulePatcher.java:433)
>> ...
>> Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenFileException: FD fd=4 type=regular path=/home/mazhen/works/patch-demo/legacy-patch.jar
>> at java.base/jdk.internal.crac.mirror.Core.translateJVMExceptions(Core.java:115)
>> ...
>>
>>
>> #### Solution
>>
>> The fundamental...
>
> src/hotspot/share/classfile/classLoader.cpp line 763:
>
>> 761: GrowableArray<int> fds;
>> 762: assert(Thread::current()->is_VM_thread(), "should be called from VM op");
>> 763: if (_patch_mod_entries == NULL) {
>
> Suggestion:
>
> if (_patch_mod_entries == nullptr) {
Thanks! Good catch. Using nullptr is definitely better practice. I'll make the change.
> src/hotspot/share/classfile/classLoader.cpp line 766:
>
>> 764: return fds;
>> 765: }
>> 766: // we don't use mutexes here because it is called from VM op
>
> As I see it, all patch module entries are opened once at VM initialization (in `ClassLoader::setup_patch_mod_entries()`) and checkpoint cannot be called before then. If that is so, there should be no VM thread / mutex requirements to read it.
That's a great point. Your analysis of the `_patch_mod_entries` lifecycle makes perfect sense. Since it's effectively read-only after initialization, the VM thread assertion isn't necessary. I'll remove it and the related comment. Thanks for spotting this!
> test/jdk/jdk/crac/fileDescriptors/PatchModuleTest.java line 140:
>
>> 138: return (String) patcherClass.getMethod("getInfo").invoke(null);
>> 139: }
>> 140: }
>
> I haven't looked through the test but if you will be applying Radim's suggestions please also add a newline at the end of file
Good catch, thanks! I've pushed a commit with the added newline.
I've also left a comment on Radim's suggestion explaining why I'm sticking with the criu engine for this test to ensure a full physical restore is validated.
Appreciate the review!
-------------
PR Review Comment: https://git.openjdk.org/crac/pull/241#discussion_r2215129538
PR Review Comment: https://git.openjdk.org/crac/pull/241#discussion_r2215132619
PR Review Comment: https://git.openjdk.org/crac/pull/241#discussion_r2218096036
More information about the crac-dev
mailing list