From jkratochvil at openjdk.org Thu Jun 1 13:44:02 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Thu, 1 Jun 2023 13:44:02 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v27] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Reintroduce the "leftover" code which was not leftover. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/28a7b591..08c2c0f2 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=26 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=25-26 Stats: 9 lines in 1 file changed: 9 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Thu Jun 1 13:47:49 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Thu, 1 Jun 2023 13:47:49 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: References: Message-ID: On Wed, 31 May 2023 13:54:48 GMT, Jan Kratochvil wrote: >> src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp line 224: >> >>> 222: stub = VM_Version::cpuinfo_cont_addr(); >>> 223: } >>> 224: } else >> >> Why do we need this block? Is not this duplicating logic from line 251? > > It was needed for some previous variant of the patch and it is a leftover. Removed. Sorry for not self-reviewing this version of my patch. It is in fact needed: https://github.com/openjdk/crac/blob/08c2c0f2b140b1011fec39a1a83037df8e59577d/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp#L217 During CRaC restore there is somehow no Java thread created yet that time. I do not understand this much so I left it there as working but maybe some Java thread should be initialized earlier? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1213180053 From jkratochvil at openjdk.org Thu Jun 1 14:30:06 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Thu, 1 Jun 2023 14:30:06 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v28] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Remove initialize_processor_count(). - requested by Anton Kozlov - it was crashing for me for 4 CPU <-> 16 CPU moves ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/08c2c0f2..2d67d94a Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=27 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=26-27 Stats: 16 lines in 2 files changed: 1 ins; 12 del; 3 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Thu Jun 1 14:35:02 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Thu, 1 Jun 2023 14:35:02 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v29] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Reintroduce initialize_processor_count() requiring -XX:+CRaCCPUCountInit. - requested by Anton Kozlov ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/2d67d94a..279ddbdc Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=28 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=27-28 Stats: 22 lines in 3 files changed: 17 ins; 1 del; 4 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Thu Jun 1 14:50:51 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Thu, 1 Jun 2023 14:50:51 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: <8xx4BgFoTXFFUZJvB4Rgg6UrF9hJDt-sCUrE_Nz8cRc=.895586ba-1428-4672-9dd7-ebaffd8b6146@github.com> References: <8xx4BgFoTXFFUZJvB4Rgg6UrF9hJDt-sCUrE_Nz8cRc=.895586ba-1428-4672-9dd7-ebaffd8b6146@github.com> Message-ID: On Wed, 31 May 2023 09:47:17 GMT, Anton Kozlov wrote: >> Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: >> >> +-XX:CPUFeatures=ignore > > src/hotspot/os/linux/os_linux.cpp line 5962: > >> 5960: initialize_processor_count(); >> 5961: if (_cpu_to_node != NULL) >> 5962: rebuild_cpu_to_node_map(); > > It seems the only place where number of processors is updated, so it's not clear how safe the operation. Please add a debug option for processor count update, I propose that to be disabled by default. OK, moved under `-XX:+CRaCCPUCountInit`. It was crashing (asserting IIRC) for me when moving between 4 and 20 vCPUs but I was unable to reproduce it now. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1213275788 From jkratochvil at openjdk.org Fri Jun 2 11:57:44 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Fri, 2 Jun 2023 11:57:44 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: Message-ID: On Fri, 12 May 2023 13:29:08 GMT, Radim Vansa wrote: >> When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. >> >> These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: >> >> * numeric file descriptor >> * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) >> * keywords FIFO and SOCKET that match pipes and sockets >> >> The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Effectively revert previous commit: Initialize logger in CRaC now started complaining to me when I try to snapshot: https://github.com/CRaC/example-jetty Suppressed: jdk.crac.impl.CheckpointOpenFileException: FileDescriptor 12 left open: tcp6 local /[0:0:0:0:0:0:0:0]:8080 remote not bound 1. I haven't found how to specify a property. I have tried below the normal `-D` command line parameter, is that correct? 2. But specifying the command line parameter has no effect. It still no longer works: $ (cd ../example-jetty/;. JAVA_HOME-crac-git-slowdebug;(sleep 4;sudo rm -f /tmp/coredump.*;sudo jcmd target/example-jetty-1.0-SNAPSHOT.jar JDK.checkpoint) & sudo rm -rf cr-host2/;sudo bash -c 'ulimit -c unlimited;. JAVA_HOME-crac-git-slowdebug;$JAVA_HOME/bin/java -XX:CRaCCheckpointTo=cr-host2 -Djdk.crac.collect-fd-stacktraces=true -Djdk.crac.fd-policy.checkpoint=CLOSE -jar target/example-jetty-1.0-SNAPSHOT.jar';echo restore;sudo $JAVA_HOME/bin/java -XX:CRaCRestoreFrom=cr-host2) 2023-06-02 13:43:57.835:INFO::main: Logging initialized @2712ms to org.eclipse.jetty.util.log.StdErrLog 2023-06-02 13:43:58.647:INFO:oejs.Server:main: jetty-9.4.30.v20200611; built: 2020-06-11T12:34:51.929Z; git: 271836e4c1f4612f12b7bb13ef5a92a927634b0d; jvm 17-internal+0-adhoc.azul.crac-git 1519: 2023-06-02 13:43:59.647:INFO:oejs.AbstractConnector:main: Started ServerConnector at 55a1c291{HTTP/1.1, (http/1.1)}{0.0.0.0:8080} 2023-06-02 13:43:59.735:INFO:oejs.Server:main: Started @4744ms Jun 02, 2023 1:44:00 PM jdk.internal.crac.LoggerContainer info INFO: /home/azul/azul/example-jetty/target/dependency/jetty-io-9.4.30.v20200611.jar is recorded as always available on restore Jun 02, 2023 1:44:00 PM jdk.internal.crac.LoggerContainer info INFO: /home/azul/azul/example-jetty/target/dependency/jetty-util-9.4.30.v20200611.jar is recorded as always available on restore Jun 02, 2023 1:44:00 PM jdk.internal.crac.LoggerContainer info INFO: /home/azul/azul/example-jetty/target/dependency/jetty-http-9.4.30.v20200611.jar is recorded as always available on restore Jun 02, 2023 1:44:00 PM jdk.internal.crac.LoggerContainer info INFO: /home/azul/azul/example-jetty/target/dependency/javax.servlet-api-3.1.0.jar is recorded as always available on restore Jun 02, 2023 1:44:00 PM jdk.internal.crac.LoggerContainer info INFO: /home/azul/azul/example-jetty/target/dependency/jetty-server-9.4.30.v20200611.jar is recorded as always available on restore Jun 02, 2023 1:44:00 PM jdk.internal.crac.LoggerContainer info INFO: /home/azul/azul/example-jetty/target/example-jetty-1.0-SNAPSHOT.jar is recorded as always available on restore An exception during a checkpoint operation: jdk.crac.CheckpointException at java.base/jdk.crac.Core.checkpointRestore1(Core.java:129) at java.base/jdk.crac.Core.checkpointRestore(Core.java:264) at java.base/jdk.crac.Core.checkpointRestoreInternal(Core.java:280) Suppressed: jdk.crac.impl.CheckpointOpenFileException: FileDescriptor 10 left open: tcp6 local /[0:0:0:0:0:0:0:0]:8080 remote not bound at java.base/java.io.FileDescriptor.beforeCheckpoint(FileDescriptor.java:391) at java.base/java.io.FileDescriptor$Resource.beforeCheckpoint(FileDescriptor.java:84) at java.base/jdk.crac.impl.PriorityContext$SubContext.invokeBeforeCheckpoint(PriorityContext.java:107) at java.base/jdk.crac.impl.OrderedContext.runBeforeCheckpoint(OrderedContext.java:70) at java.base/jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(AbstractContextImpl.java:81) at java.base/jdk.crac.impl.AbstractContextImpl.invokeBeforeCheckpoint(AbstractContextImpl.java:41) at java.base/jdk.crac.impl.PriorityContext.runBeforeCheckpoint(PriorityContext.java:70) at java.base/jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(AbstractContextImpl.java:81) at java.base/jdk.internal.crac.JDKContext.beforeCheckpoint(JDKContext.java:97) at java.base/jdk.crac.impl.AbstractContextImpl.invokeBeforeCheckpoint(AbstractContextImpl.java:41) at java.base/jdk.crac.impl.OrderedContext.runBeforeCheckpoint(OrderedContext.java:70) at java.base/jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(AbstractContextImpl.java:81) at java.base/jdk.crac.Core.checkpointRestore1(Core.java:127) ... 2 more Caused by: java.lang.Exception: This file descriptor was created by main at epoch:1685706239263 here at java.base/java.io.FileDescriptor$Resource.(FileDescriptor.java:75) at java.base/java.io.FileDescriptor.(FileDescriptor.java:104) at java.base/sun.nio.ch.IOUtil.newFD(IOUtil.java:544) at java.base/sun.nio.ch.Net.serverSocket(Net.java:534) at java.base/sun.nio.ch.ServerSocketChannelImpl.(ServerSocketChannelImpl.java:128) at java.base/sun.nio.ch.ServerSocketChannelImpl.(ServerSocketChannelImpl.java:109) at java.base/sun.nio.ch.SelectorProviderImpl.openServerSocketChannel(SelectorProviderImpl.java:72) at java.base/java.nio.channels.ServerSocketChannel.open(ServerSocketChannel.java:145) at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:339) at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72) at org.eclipse.jetty.server.Server.doStart(Server.java:386) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72) at com.example.ServerManager.(App.java:27) at com.example.App.main(App.java:73) restore open cppath: No such file or directory I am curious it says `Suppressed:` there but the snapshot still crashes due to `An exception during a checkpoint operation:`. 3. The setting `-Djdk.crac.fd-policy.checkpoint=CLOSE` would be wrong anyway. I do not want to silently close some ongoing TCP client-server communication. But in this my case a socket in LISTEN state could be safely automatically restored without any coordination from the Java application: `tcp6 local /[0:0:0:0:0:0:0:0]:8080 remote not bound` ------------- PR Comment: https://git.openjdk.org/crac/pull/69#issuecomment-1573614068 From duke at openjdk.org Mon Jun 5 07:02:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 5 Jun 2023 07:02:41 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 16:10:44 GMT, Radim Vansa wrote: > Some parts of JDK expect that #cpus is constant; update those places after restore. Given the plans to rebase on a more recent JDK I think this can be closed. When the rebase happens I will check again on the NCPU changes. ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1576154685 From duke at openjdk.org Mon Jun 5 07:02:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 5 Jun 2023 07:02:41 GMT Subject: [crac] Withdrawn: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 16:10:44 GMT, Radim Vansa wrote: > Some parts of JDK expect that #cpus is constant; update those places after restore. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/crac/pull/54 From duke at openjdk.org Mon Jun 5 07:03:42 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 5 Jun 2023 07:03:42 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Thu, 13 Apr 2023 15:45:50 GMT, Anton Kozlov wrote: >> @AntonKozlov >> >>> Crac-criu does not use restore timens [1] since once a bug in kernel or criu caused timedwait to return immediatelly everytime that is called after restore. I don't remember the bug exactly (already fixed), but I believe it was discussed on this maillist >> >> https://github.com/CRaC/criu/commit/1cb2f4a518a4ae471a1df7a9b540203c1efaf1ba commit is dated July 14, 2020, but the crac-dev archives has earliest mailing list from Sept 2021. Is there some other mailing list this was discussed on? I am interested in understanding the problem that prompted not to use timens in criu. >> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. >> >>> In general, we should not to depend on very obscure linux abillities, as this reduce chances we'd be able to run on something rather than linux. >> >> I don't think timens can be put in the category of obscure linux ability. It has even made its way into container runtime spec: https://github.com/opencontainers/runtime-spec/pull/1151. > >> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. > > AFAIK the bug is fixed, but I see no point of relying on OS here. Is there one? Timens that is not changed by CRIU provides correct values for our nanoTime() [1]. > >> The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine > > [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/System.html#nanoTime() @AntonKozlov Anything more needed for this PR? ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1576156220 From duke at openjdk.org Mon Jun 5 13:33:50 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 5 Jun 2023 13:33:50 GMT Subject: [crac] RFR: Linux file system watcher support [v4] In-Reply-To: <3Jl4KmUfedZAAzBzRZQZu1hJpBpdwvzilrUpKX4KcZo=.3c77ae78-59ee-481c-a0ff-41e03c32270a@github.com> References: <3Jl4KmUfedZAAzBzRZQZu1hJpBpdwvzilrUpKX4KcZo=.3c77ae78-59ee-481c-a0ff-41e03c32270a@github.com> Message-ID: On Thu, 25 May 2023 09:26:01 GMT, joeylee wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > joeylee has updated the pull request incrementally with one additional commit since the last revision: > > update LGTM. Sorry for taking a while, I missed your last update. Please type `/integrate` so that @AntonKozlov can `/sponsor` and get this merged if he's OK with these changes (I am not a committer myself yet). ------------- Marked as reviewed by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/72#pullrequestreview-1462551707 From duke at openjdk.org Mon Jun 5 13:52:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 5 Jun 2023 13:52:41 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v5] In-Reply-To: References: Message-ID: On Wed, 31 May 2023 08:57:08 GMT, Anton Kozlov wrote: >> A follow-up work for #60: >> >> * Each priority now has a dedicated context, so contextes may provide different policies. >> * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. >> * hierachy of the Context implementations are cleaned up a bit [2] >> >> The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 >> [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Update I don't agree with some of your conclusions and handling some deficiencies in a follow-ups, but I guess this is as far as we can get here. Therefore approved, let's move on. I hope that https://github.com/openjdk/crac/pull/64 can be addressed soon to fix on losing stack traces and messages from CheckpointException. test/jdk/jdk/crac/ContextOrderTest.java line 147: > 145: thread.interrupt(); > 146: thread.join(TimeUnit.NANOSECONDS.toMillis(deadline - System.nanoTime())); > 147: if (thread.getState() == Thread.State.WAITING) { I think you wanted to remove printing the stack traces here, too. ------------- Marked as reviewed by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/74#pullrequestreview-1462590184 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1218100148 From duke at openjdk.org Mon Jun 5 13:57:31 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 5 Jun 2023 13:57:31 GMT Subject: [crac] RFR: Synchronize concurrent clean() in PhantomCleanableRef In-Reply-To: References: Message-ID: On Fri, 12 May 2023 13:50:34 GMT, Radim Vansa wrote: > Fixes failures in RefQueueTest and JarFileFactoryCacheTest. It seems that the general approach in #73 would be favored, though it's a more complicated one. Feel free to reopen. ------------- PR Comment: https://git.openjdk.org/crac/pull/70#issuecomment-1576852060 From duke at openjdk.org Mon Jun 5 13:57:32 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 5 Jun 2023 13:57:32 GMT Subject: [crac] Withdrawn: Synchronize concurrent clean() in PhantomCleanableRef In-Reply-To: References: Message-ID: On Fri, 12 May 2023 13:50:34 GMT, Radim Vansa wrote: > Fixes failures in RefQueueTest and JarFileFactoryCacheTest. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/crac/pull/70 From duke at openjdk.org Tue Jun 6 08:01:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 6 Jun 2023 08:01:41 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications [v2] In-Reply-To: References: Message-ID: > We block the cleaner thread to prevent race conditions between this thread and checkpointing thread invoking clean(). > When the cleanup starts in cleaner thread the checkpoint will skip it, but without waiting for the cleanup to finish (which might be critical for the checkpoint, e.g. closing FDs). > The limitation is that code performing C/R must not wait on any task completed by the cleaner. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Clean GC'ed references on checkpoint - Merge branch 'crac' into sync_cleaner - Prevent concurrent cleanup by cleaner thread and checkpoint notifications ------------- Changes: - all: https://git.openjdk.org/crac/pull/73/files - new: https://git.openjdk.org/crac/pull/73/files/ef4ea426..43d7e41f Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=73&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=73&range=00-01 Stats: 742 lines in 24 files changed: 541 ins; 71 del; 130 mod Patch: https://git.openjdk.org/crac/pull/73.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/73/head:pull/73 PR: https://git.openjdk.org/crac/pull/73 From duke at openjdk.org Tue Jun 6 08:15:51 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 6 Jun 2023 08:15:51 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v3] In-Reply-To: References: Message-ID: > Rather than registering PhantomCleanableRef as a resource that gets cleaned up in the C/R thread force cleanup in the CleanerImpl as this already keeps a list of all eligible references. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Don't block the cleaner thread during C/R at all. ------------- Changes: - all: https://git.openjdk.org/crac/pull/73/files - new: https://git.openjdk.org/crac/pull/73/files/43d7e41f..8ec3aee5 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=73&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=73&range=01-02 Stats: 19 lines in 1 file changed: 2 ins; 9 del; 8 mod Patch: https://git.openjdk.org/crac/pull/73.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/73/head:pull/73 PR: https://git.openjdk.org/crac/pull/73 From duke at openjdk.org Tue Jun 6 08:15:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 6 Jun 2023 08:15:54 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v2] In-Reply-To: References: Message-ID: On Tue, 6 Jun 2023 08:01:41 GMT, Radim Vansa wrote: >> Rather than registering PhantomCleanableRef as a resource that gets cleaned up in the C/R thread force cleanup in the CleanerImpl as this already keeps a list of all eligible references. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Clean GC'ed references on checkpoint > - Merge branch 'crac' into sync_cleaner > - Prevent concurrent cleanup by cleaner thread and checkpoint notifications @AntonKozlov Updated. ------------- PR Comment: https://git.openjdk.org/crac/pull/73#issuecomment-1578147252 From akozlov at openjdk.org Tue Jun 6 16:19:20 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 6 Jun 2023 16:19:20 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v3] In-Reply-To: References: Message-ID: On Tue, 6 Jun 2023 08:15:51 GMT, Radim Vansa wrote: >> Rather than registering PhantomCleanableRef as a resource that gets cleaned up in the C/R thread force cleanup in the CleanerImpl as this already keeps a list of all eligible references. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Don't block the cleaner thread during C/R at all. LGTM, thank you! src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 160: > 158: wait(); > 159: } > 160: cleanupComplete = false; AFIAU this synchronizes with beforeCheckpoint to verfiy cleanupComplete==true is noticied. An alternative would be to drop this sync, releasing cleaner sooner, and then just reset cleanupComplete in beforeCheckpoint(). But this is not a big deal. ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/73#pullrequestreview-1465587337 PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1219940547 From akozlov at openjdk.org Tue Jun 6 16:41:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 6 Jun 2023 16:41:56 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v5] In-Reply-To: References: Message-ID: On Wed, 31 May 2023 08:57:08 GMT, Anton Kozlov wrote: >> A follow-up work for #60: >> >> * Each priority now has a dedicated context, so contextes may provide different policies. >> * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. >> * hierachy of the Context implementations are cleaned up a bit [2] >> >> The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 >> [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Update You get me right. Meanwhile I had change to use contextes with the blocking policies, and in the current form the auto-deadlock fires too often. I was assuming deadlock is fine, once it a rare edge case in the JDK internals. But as the Global Context implementation, blocking context with auto-deadlock possibility is not acceptable IMO. We unlikely want to offer such to implementation to users. So definetely, this is not the last change in the area :) ------------- PR Comment: https://git.openjdk.org/crac/pull/74#issuecomment-1579098908 From akozlov at openjdk.org Tue Jun 6 16:42:14 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 6 Jun 2023 16:42:14 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v5] In-Reply-To: References: Message-ID: On Mon, 5 Jun 2023 13:46:05 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Update > > test/jdk/jdk/crac/ContextOrderTest.java line 147: > >> 145: thread.interrupt(); >> 146: thread.join(TimeUnit.NANOSECONDS.toMillis(deadline - System.nanoTime())); >> 147: if (thread.getState() == Thread.State.WAITING) { > > I think you wanted to remove printing the stack traces here, too. Let's have it there. This is on the failure path, so this won't appear until the failure. And everytime the failure happens, it's really required to know what is the stack trace of the thread that does not respond to the interruption. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1219959425 From akozlov at openjdk.org Tue Jun 6 16:44:34 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 6 Jun 2023 16:44:34 GMT Subject: [crac] Integrated: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Wed, 17 May 2023 14:44:08 GMT, Anton Kozlov wrote: > A follow-up work for #60: > > * Each priority now has a dedicated context, so contextes may provide different policies. > * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. > * hierachy of the Context implementations are cleaned up a bit [2] > > The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. > > [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 > [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 This pull request has now been integrated. Changeset: 6f403eaf Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/6f403eaf7655f9ce4d10da25082a836d0ad1574c Stats: 918 lines in 29 files changed: 283 ins; 551 del; 84 mod Introduce per-Priority Context with different policies ------------- PR: https://git.openjdk.org/crac/pull/74 From akozlov at openjdk.org Tue Jun 6 17:06:51 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 6 Jun 2023 17:06:51 GMT Subject: [crac] RFR: Use special class for exception aggregates [v4] In-Reply-To: References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> <6DVU9HYy46e5dglqxqUXNfl17sKEM7XuP1H5kplhEC8=.caead7f0-602c-4fa3-a1e5-97b09a50beb4@github.com> Message-ID: On Tue, 30 May 2023 06:32:48 GMT, Radim Vansa wrote: >> What will be the point of CheckpointException then? A more specific exception will also be preferable instead of that, isn't it? > > `Core` would throw a generic exception when something failed in the native part (`criuengine` returned exit code 1...). A custom implementation of `Context` would throw it when it can't call its children for some reason (but it's not a failure in the Resource itself). But that does not take away the need for the base CheckpointException to be able aggregate exceptions from Resources, so CheckpointException and Combined are mostly overlaping. Moreover, so we'll need to be able to re-aggregate exceptions of two types. In the current form, Combined and CheckpointException are both fine to be thrown from a user context, and there is no guadance which one to use. Let's keep it simple and use CheckpointException for aggregations as it was supposed to be. The message constructor should be removed, and every constructor-with-the-message call should be replaced with an exception of different type. Especially since we have examples for this [1]. [1] https://github.com/openjdk/crac/blob/6f403eaf7655f9ce4d10da25082a836d0ad1574c/src/java.base/share/classes/jdk/crac/Core.java#L164 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1220015931 From duke at openjdk.org Wed Jun 7 06:20:24 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 7 Jun 2023 06:20:24 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v4] In-Reply-To: References: Message-ID: <_2e90XQOh8RGjbo17OuygECbgyitoYuvr5gGlVUzCts=.73a2bee1-8345-4387-94c2-8f10f0c1a955@github.com> > Rather than registering PhantomCleanableRef as a resource that gets cleaned up in the C/R thread force cleanup in the CleanerImpl as this already keeps a list of all eligible references. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Cleaner thread does not need to wait for checkpoint thread ------------- Changes: - all: https://git.openjdk.org/crac/pull/73/files - new: https://git.openjdk.org/crac/pull/73/files/8ec3aee5..2b947915 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=73&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=73&range=02-03 Stats: 20 lines in 1 file changed: 1 ins; 8 del; 11 mod Patch: https://git.openjdk.org/crac/pull/73.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/73/head:pull/73 PR: https://git.openjdk.org/crac/pull/73 From duke at openjdk.org Wed Jun 7 06:20:25 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 7 Jun 2023 06:20:25 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v3] In-Reply-To: References: Message-ID: On Tue, 6 Jun 2023 16:16:19 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Don't block the cleaner thread during C/R at all. > > src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 160: > >> 158: wait(); >> 159: } >> 160: cleanupComplete = false; > > AFIAU this synchronizes with beforeCheckpoint to verfiy cleanupComplete==true is noticied. An alternative would be to drop this sync, releasing cleaner sooner, and then just reset cleanupComplete in beforeCheckpoint(). But this is not a big deal. This has a point, the waiting is not needed in here. I've moved the synchronization from previous version rather mindlessly. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1220916493 From duke at openjdk.org Wed Jun 7 08:18:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 7 Jun 2023 08:18:09 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v5] In-Reply-To: References: Message-ID: > Rather than registering PhantomCleanableRef as a resource that gets cleaned up in the C/R thread force cleanup in the CleanerImpl as this already keeps a list of all eligible references. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge branch 'crac' into sync_cleaner - Cleaner thread does not need to wait for checkpoint thread - Don't block the cleaner thread during C/R at all. - Clean GC'ed references on checkpoint - Merge branch 'crac' into sync_cleaner - Prevent concurrent cleanup by cleaner thread and checkpoint notifications ------------- Changes: https://git.openjdk.org/crac/pull/73/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=73&range=04 Stats: 143 lines in 9 files changed: 99 ins; 33 del; 11 mod Patch: https://git.openjdk.org/crac/pull/73.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/73/head:pull/73 PR: https://git.openjdk.org/crac/pull/73 From akozlov at azul.com Wed Jun 7 08:30:13 2023 From: akozlov at azul.com (Anton Kozlov) Date: Wed, 7 Jun 2023 11:30:13 +0300 Subject: Result: New CRaC Committer: Radim Vansa Message-ID: <9974ee4d-ab6f-865e-bedf-a0827f9642cb@azul.com> Voting for Radim Vansa [1] is now closed. Yes: 2 Veto: 0 Abstain: 0 According to the Bylaws definition of Lazy Consensus, this is sufficient to approve the nomination. Anton Kozlov [1] https://mail.openjdk.org/pipermail/crac-dev/2023-May/000891.html From duke at openjdk.org Wed Jun 7 08:32:25 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 7 Jun 2023 08:32:25 GMT Subject: [crac] RFR: Use special class for exception aggregates [v4] In-Reply-To: References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> <6DVU9HYy46e5dglqxqUXNfl17sKEM7XuP1H5kplhEC8=.caead7f0-602c-4fa3-a1e5-97b09a50beb4@github.com> Message-ID: On Tue, 6 Jun 2023 17:03:43 GMT, Anton Kozlov wrote: >> `Core` would throw a generic exception when something failed in the native part (`criuengine` returned exit code 1...). A custom implementation of `Context` would throw it when it can't call its children for some reason (but it's not a failure in the Resource itself). > > But that does not take away the need for the base CheckpointException to be able aggregate exceptions from Resources, so CheckpointException and Combined are mostly overlaping. Moreover, so we'll need to be able to re-aggregate exceptions of two types. In the current form, Combined and CheckpointException are both fine to be thrown from a user context, and there is no guadance which one to use. > > Let's keep it simple and use CheckpointException for aggregations as it was supposed to be. The message constructor should be removed, and every constructor-with-the-message call should be replaced with an exception of different type. Especially since we have examples for this [1]. > > [1] https://github.com/openjdk/crac/blob/6f403eaf7655f9ce4d10da25082a836d0ad1574c/src/java.base/share/classes/jdk/crac/Core.java#L164 There's nothing that would prevent the message including exception from having suppressed exceptions, it just means that the 'carrier' would not be discarded. Anyway, if you insist on single type, yes, we can effectively make the CheckpointException = Combined by removing the message, and keeping all info in the suppressed exceptions. Not the best way, IMO, but my main concern is not losing any information by accident. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1221147821 From duke at openjdk.org Wed Jun 7 10:53:32 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 7 Jun 2023 10:53:32 GMT Subject: [crac] RFR: Use special class for exception aggregates [v5] In-Reply-To: References: Message-ID: > Extracted non-essential changes from other PR. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge branch 'crac' into code_cleanup - Remove the .Combined, always use the main exception as aggregate - Merge branch 'crac' into code_cleanup - Handle .Combined in Core - Use Combiner exception class for aggregate-only exceptions - Minor code cleanup and improvements ------------- Changes: https://git.openjdk.org/crac/pull/64/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=64&range=04 Stats: 69 lines in 9 files changed: 9 ins; 39 del; 21 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From akozlov at openjdk.org Wed Jun 7 10:59:00 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 7 Jun 2023 10:59:00 GMT Subject: [crac] RFR: Draft: Move more FD tracking to java layer Message-ID: The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) ... 7 more A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. ------------- Commit messages: - Remove misleading UnreachableSocket - Merge remote-tracking branch 'jdk/crac/crac' into newfd - Cleanup - Remove AcceptingSocketResource - Cleanup and extend FileDescriptor nativeDesc - Update - Workaround for the blocking context - Claim closing socket - Cleanup - Cleanup - ... and 83 more: https://git.openjdk.org/crac/compare/6f403eaf...e698e18e Changes: https://git.openjdk.org/crac/pull/79/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=79&range=00 Stats: 742 lines in 22 files changed: 365 ins; 296 del; 81 mod Patch: https://git.openjdk.org/crac/pull/79.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/79/head:pull/79 PR: https://git.openjdk.org/crac/pull/79 From akozlov at openjdk.org Wed Jun 7 11:24:25 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 7 Jun 2023 11:24:25 GMT Subject: [crac] RFR: Use special class for exception aggregates [v5] In-Reply-To: References: Message-ID: On Wed, 7 Jun 2023 10:53:32 GMT, Radim Vansa wrote: >> Extracted non-essential changes from other PR. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: > > - Merge branch 'crac' into code_cleanup > - Remove the .Combined, always use the main exception as aggregate > - Merge branch 'crac' into code_cleanup > - Handle .Combined in Core > - Use Combiner exception class for aggregate-only exceptions > - Minor code cleanup and improvements Few minor comments src/java.base/share/classes/jdk/crac/Core.java line 147: > 145: if (messages.length == 0) { > 146: checkpointException.addSuppressed(new RuntimeException("Native checkpoint failed")); > 147: } codes.length == messages.length, and they are used to communicate native descriptors problems (handled in translateJVMExceptions). So a problem should be reported based on the retCode below, if that is not reported already (?) src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 1: Empty file src/java.base/share/classes/jdk/crac/impl/PriorityContext.java line 1: Empty file ------------- PR Review: https://git.openjdk.org/crac/pull/64#pullrequestreview-1467353319 PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1221390006 PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1221395950 PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1221395710 From akozlov at openjdk.org Wed Jun 7 11:24:25 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 7 Jun 2023 11:24:25 GMT Subject: [crac] RFR: Use special class for exception aggregates [v5] In-Reply-To: References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> <6DVU9HYy46e5dglqxqUXNfl17sKEM7XuP1H5kplhEC8=.caead7f0-602c-4fa3-a1e5-97b09a50beb4@github.com> Message-ID: On Wed, 7 Jun 2023 08:29:32 GMT, Radim Vansa wrote: >> But that does not take away the need for the base CheckpointException to be able aggregate exceptions from Resources, so CheckpointException and Combined are mostly overlaping. Moreover, so we'll need to be able to re-aggregate exceptions of two types. In the current form, Combined and CheckpointException are both fine to be thrown from a user context, and there is no guadance which one to use. >> >> Let's keep it simple and use CheckpointException for aggregations as it was supposed to be. The message constructor should be removed, and every constructor-with-the-message call should be replaced with an exception of different type. Especially since we have examples for this [1]. >> >> [1] https://github.com/openjdk/crac/blob/6f403eaf7655f9ce4d10da25082a836d0ad1574c/src/java.base/share/classes/jdk/crac/Core.java#L164 > > There's nothing that would prevent the message including exception from having suppressed exceptions, it just means that the 'carrier' would not be discarded. > > Anyway, if you insist on single type, yes, we can effectively make the CheckpointException = Combined by removing the message, and keeping all info in the suppressed exceptions. Not the best way, IMO, but my main concern is not losing any information by accident. Thank you! I think the PR is mostly good, see the comments. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1221412416 From duke at openjdk.org Wed Jun 7 11:35:31 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 7 Jun 2023 11:35:31 GMT Subject: [crac] RFR: Use special class for exception aggregates [v6] In-Reply-To: References: Message-ID: > Extracted non-essential changes from other PR. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Dropped empty files (merge artifact) ------------- Changes: - all: https://git.openjdk.org/crac/pull/64/files - new: https://git.openjdk.org/crac/pull/64/files/82582932..128b64a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=64&range=05 - incr: https://webrevs.openjdk.org/?repo=crac&pr=64&range=04-05 Stats: 0 lines in 2 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From duke at openjdk.org Wed Jun 7 11:35:32 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 7 Jun 2023 11:35:32 GMT Subject: [crac] RFR: Use special class for exception aggregates [v5] In-Reply-To: References: Message-ID: <4FoWjSi9jtmLTxrBOi2cdSH2c0KFrFRmV46IeCn57Nc=.d92289a4-40fb-48d0-ae56-48278518f348@github.com> On Wed, 7 Jun 2023 11:01:05 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: >> >> - Merge branch 'crac' into code_cleanup >> - Remove the .Combined, always use the main exception as aggregate >> - Merge branch 'crac' into code_cleanup >> - Handle .Combined in Core >> - Use Combiner exception class for aggregate-only exceptions >> - Minor code cleanup and improvements > > src/java.base/share/classes/jdk/crac/Core.java line 147: > >> 145: if (messages.length == 0) { >> 146: checkpointException.addSuppressed(new RuntimeException("Native checkpoint failed")); >> 147: } > > codes.length == messages.length, and they are used to communicate native descriptors problems (handled in translateJVMExceptions). So a problem should be reported based on the retCode below, if that is not reported already (?) When the checkpoint does not fail in JVM code (checking open FDs) but the engine returns non-zero status, the retCode is `JVM_CHECKPOINT_ERROR` but the messages are empty. In this case the exception would look just like CheckpointException without any explanation nor stack trace, and the user would be rather clueless. We could add to this message a hint to check out `dump4.log`, but I am not entirely sure about this as it breaks encapsulation of CR engine (alternative implementations... though we might make CRIU use 'standardized' location) and it might not exist at all, e.g. when the CRIU binary is not present. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1221418669 From jkratochvil at openjdk.org Wed Jun 7 12:57:59 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 7 Jun 2023 12:57:59 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v30] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request incrementally with two additional commits since the last revision: - Fix hotspot 'ht' vs. glibc 'htt'. - CPUFeatures refactorization. Start CPU Features checking without libc. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/279ddbdc..d3cb3b55 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=29 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=28-29 Stats: 149 lines in 2 files changed: 69 ins; 45 del; 35 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Wed Jun 7 13:08:30 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 7 Jun 2023 13:08:30 GMT Subject: [crac] RFR: Draft: Move more FD tracking to java layer In-Reply-To: References: Message-ID: <5S1ePK22x8ZfSuDvQHrh1vJkbSlBj3-KNGmNEmGVMIo=.3521df6d-fe72-48f5-bd23-552f6ab87561@github.com> On Wed, 7 Jun 2023 10:51:41 GMT, Anton Kozlov wrote: > The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. > > We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). > > > Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] > at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) > at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) > ... 7 more > > > A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. > > I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. You've shown an example with socket, but is there any case when this gets really helpful? You present a socket, but there's no extra info provided compared to what we already had. I guess that you intend to have the claiming part implemented later on for most of the 'owners', but I think that most of the time the FileDescriptor is created within owner constructor/init method, and therefore the owner would be obvious from the stack trace. One point would be for not requiring a re-run with stack trace collection, but when you figure out that FD was created by RandomAccessFile you're not really closer to the part of code you need to fix - you wouldn't scan all your dependencies for any use of RAF. You mention investigating heap dumps - I don't think that's needed anymore when you have the stack trace. Could you present a real case (preferrably in a test!) where knowing the ownership in terms of references (as from the heap dump) is more practical? One risk I perceive with this approach is when the FD is not owned directly, but through a chain of possible owners, e.g. FileDescriptor -> FileOutputStream -> FileWriter -> LibraryClass -> ApplicationClass. You would have to propagate the FileDescriptor through the layers, breaking encapsulation and adding more and more code, or lose the information about ownership anyway. test/jdk/jdk/crac/fileDescriptors/OpenFileDetectionTest.java line 51: > 49: @Override > 50: public void exec() throws Exception { > 51: try (var file = new RandomAccessFile("/etc/passwd", "r")) { What would happen with the FileReader? It would be better to not remove test for particular case, but add one for RandomAccessFile. ------------- PR Review: https://git.openjdk.org/crac/pull/79#pullrequestreview-1467539190 PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1221514924 From akozlov at openjdk.org Wed Jun 7 15:44:25 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 7 Jun 2023 15:44:25 GMT Subject: [crac] RFR: Use special class for exception aggregates [v5] In-Reply-To: <4FoWjSi9jtmLTxrBOi2cdSH2c0KFrFRmV46IeCn57Nc=.d92289a4-40fb-48d0-ae56-48278518f348@github.com> References: <4FoWjSi9jtmLTxrBOi2cdSH2c0KFrFRmV46IeCn57Nc=.d92289a4-40fb-48d0-ae56-48278518f348@github.com> Message-ID: On Wed, 7 Jun 2023 11:27:30 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/Core.java line 147: >> >>> 145: if (messages.length == 0) { >>> 146: checkpointException.addSuppressed(new RuntimeException("Native checkpoint failed")); >>> 147: } >> >> codes.length == messages.length, and they are used to communicate native descriptors problems (handled in translateJVMExceptions). So a problem should be reported based on the retCode below, if that is not reported already (?) > > When the checkpoint does not fail in JVM code (checking open FDs) but the engine returns non-zero status, the retCode is `JVM_CHECKPOINT_ERROR` but the messages are empty. In this case the exception would look just like CheckpointException without any explanation nor stack trace, and the user would be rather clueless. > We could add to this message a hint to check out `dump4.log`, but I am not entirely sure about this as it breaks encapsulation of CR engine (alternative implementations... though we might make CRIU use 'standardized' location) and it might not exist at all, e.g. when the CRIU binary is not present. Oh, I see. The problem is CHECKPOINT_ERROR serves too many purposes: native descriptor failure, native checkpoint failure, and even native restore failure. I'm fine with this code, but please move this if to translateJVMExceptions with a FIXME comment. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1221820652 From akozlov at openjdk.org Wed Jun 7 15:49:33 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 7 Jun 2023 15:49:33 GMT Subject: [crac] RFR: Use special class for exception aggregates [v5] In-Reply-To: References: <4FoWjSi9jtmLTxrBOi2cdSH2c0KFrFRmV46IeCn57Nc=.d92289a4-40fb-48d0-ae56-48278518f348@github.com> Message-ID: On Wed, 7 Jun 2023 15:42:00 GMT, Anton Kozlov wrote: >> When the checkpoint does not fail in JVM code (checking open FDs) but the engine returns non-zero status, the retCode is `JVM_CHECKPOINT_ERROR` but the messages are empty. In this case the exception would look just like CheckpointException without any explanation nor stack trace, and the user would be rather clueless. >> We could add to this message a hint to check out `dump4.log`, but I am not entirely sure about this as it breaks encapsulation of CR engine (alternative implementations... though we might make CRIU use 'standardized' location) and it might not exist at all, e.g. when the CRIU binary is not present. > > Oh, I see. The problem is CHECKPOINT_ERROR serves too many purposes: native descriptor failure, native checkpoint failure, and even native restore failure. I'm fine with this code, but please move this if to translateJVMExceptions with a FIXME comment. Regarding providing more details, in general, that should be handled by the engine. The criuengine (as the entity controlling the dump4.log name and the path) has a chance to print a few lines from the log. E.g. by calling `system("tail ...")`, although not very pretty, for that will be better than nothing. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1221827050 From akozlov at openjdk.org Wed Jun 7 16:00:32 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 7 Jun 2023 16:00:32 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 7 Jun 2023 06:14:36 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 160: >> >>> 158: wait(); >>> 159: } >>> 160: cleanupComplete = false; >> >> AFIAU this synchronizes with beforeCheckpoint to verfiy cleanupComplete==true is noticied. An alternative would be to drop this sync, releasing cleaner sooner, and then just reset cleanupComplete in beforeCheckpoint(). But this is not a big deal. > > This has a point, the waiting is not needed in here. I've moved the synchronization from previous version rather mindlessly. In the current state, at initialization `forceCleanup == cleanupComplete == false`. But after synchronization has been started, `forceCleanup == !cleanupComplete`, always. It seems possible to drop cleanupComplete, right? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1221843728 From duke at openjdk.org Thu Jun 8 06:35:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 8 Jun 2023 06:35:45 GMT Subject: [crac] RFR: Use special class for exception aggregates [v7] In-Reply-To: References: Message-ID: > Extracted non-essential changes from other PR. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Moved exception after native failure ------------- Changes: - all: https://git.openjdk.org/crac/pull/64/files - new: https://git.openjdk.org/crac/pull/64/files/128b64a3..50926f02 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=64&range=06 - incr: https://webrevs.openjdk.org/?repo=crac&pr=64&range=05-06 Stats: 10 lines in 1 file changed: 5 ins; 3 del; 2 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From duke at openjdk.org Thu Jun 8 08:35:34 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 8 Jun 2023 08:35:34 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v6] In-Reply-To: References: Message-ID: > Rather than registering PhantomCleanableRef as a resource that gets cleaned up in the C/R thread force cleanup in the CleanerImpl as this already keeps a list of all eligible references. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Remove cleanupComplete, handle thread termination ------------- Changes: - all: https://git.openjdk.org/crac/pull/73/files - new: https://git.openjdk.org/crac/pull/73/files/4e4d8047..0b58e7e1 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=73&range=05 - incr: https://webrevs.openjdk.org/?repo=crac&pr=73&range=04-05 Stats: 15 lines in 1 file changed: 9 ins; 3 del; 3 mod Patch: https://git.openjdk.org/crac/pull/73.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/73/head:pull/73 PR: https://git.openjdk.org/crac/pull/73 From duke at openjdk.org Thu Jun 8 08:35:36 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 8 Jun 2023 08:35:36 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 7 Jun 2023 15:57:49 GMT, Anton Kozlov wrote: >> This has a point, the waiting is not needed in here. I've moved the synchronization from previous version rather mindlessly. > > In the current state, at initialization `forceCleanup == cleanupComplete == false`. But after synchronization has been started, `forceCleanup == !cleanupComplete`, always. It seems possible to drop cleanupComplete, right? Right. I also realized that I didn't handle thread termination well; had the thread exited just after setting `forceCleanup`, the checkpoint thread might end up waiting for a notify that never comes. I am not sure if this can happen in practice, because one of the refs in the list is the cleaner itself, but let's be on the safe side. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1222641337 From akozlov at openjdk.org Thu Jun 8 09:26:21 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 09:26:21 GMT Subject: [crac] RFR: Use special class for exception aggregates [v7] In-Reply-To: References: Message-ID: On Thu, 8 Jun 2023 06:35:45 GMT, Radim Vansa wrote: >> Extracted non-essential changes from other PR. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Moved exception after native failure Could you please update the PR titile to match the new content? src/java.base/share/classes/jdk/crac/CheckpointException.java line 36: > 34: > 35: /** > 36: * Creates a {@code CheckpointException}. I probably misclicked "Add comment", https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Throwable.html#%3Cinit%3E(java.lang.String,java.lang.Throwable,boolean,boolean): > Subclasses of Throwable should document any conditions under which suppression is disabled and document conditions under which the stack trace is not writable. The javadoc should specify the stack trace is not writeable, here and for other exceptions ------------- PR Review: https://git.openjdk.org/crac/pull/64#pullrequestreview-1469403293 PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1222706815 From akozlov at openjdk.org Thu Jun 8 11:22:19 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 11:22:19 GMT Subject: [crac] RFR: Active cleanup by CleanerImpl on checkpoint [v6] In-Reply-To: References: Message-ID: On Thu, 8 Jun 2023 08:35:34 GMT, Radim Vansa wrote: >> Rather than registering PhantomCleanableRef as a resource that gets cleaned up in the C/R thread force cleanup in the CleanerImpl as this already keeps a list of all eligible references. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove cleanupComplete, handle thread termination LGTM, thank you! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/73#pullrequestreview-1469688133 From duke at openjdk.org Thu Jun 8 11:25:20 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 8 Jun 2023 11:25:20 GMT Subject: [crac] Integrated: Active cleanup by CleanerImpl on checkpoint In-Reply-To: References: Message-ID: On Tue, 16 May 2023 11:52:54 GMT, Radim Vansa wrote: > Rather than registering PhantomCleanableRef as a resource that gets cleaned up in the C/R thread force cleanup in the CleanerImpl as this already keeps a list of all eligible references. This pull request has now been integrated. Changeset: 241a3465 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/241a3465c1358e6f9bcabb03aaa211d6cd3cfaa9 Stats: 149 lines in 9 files changed: 105 ins; 33 del; 11 mod Active cleanup by CleanerImpl on checkpoint Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/73 From duke at openjdk.org Thu Jun 8 11:50:31 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 8 Jun 2023 11:50:31 GMT Subject: [crac] RFR: Make CheckpointException/RestoreException aggregate-only [v8] In-Reply-To: References: Message-ID: > CheckpointException and RestoreException (both jdk.crac and javax.crac) cannot carry any message, cause and don't collect stack trace. The sole purpose of these is to aggregate other exceptions in suppressed exceptions. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Update javadoc, make final ------------- Changes: - all: https://git.openjdk.org/crac/pull/64/files - new: https://git.openjdk.org/crac/pull/64/files/50926f02..78dcaca2 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=64&range=07 - incr: https://webrevs.openjdk.org/?repo=crac&pr=64&range=06-07 Stats: 24 lines in 4 files changed: 8 ins; 8 del; 8 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From duke at openjdk.org Thu Jun 8 11:50:36 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 8 Jun 2023 11:50:36 GMT Subject: [crac] RFR: Make CheckpointException/RestoreException aggregate-only [v7] In-Reply-To: References: Message-ID: On Thu, 8 Jun 2023 09:23:27 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Moved exception after native failure > > src/java.base/share/classes/jdk/crac/CheckpointException.java line 36: > >> 34: >> 35: /** >> 36: * Creates a {@code CheckpointException}. > > I probably misclicked "Add comment", > > https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Throwable.html#%3Cinit%3E(java.lang.String,java.lang.Throwable,boolean,boolean): >> Subclasses of Throwable should document any conditions under which suppression is disabled and document conditions under which the stack trace is not writable. > > The javadoc should specify the stack trace is not writeable, here and for other exceptions Updated. I've also made those classes final to prevent subclassing with a specific meaning - the actual information should be in the suppressed exceptions. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1222925408 From akozlov at openjdk.org Thu Jun 8 13:33:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 13:33:23 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: References: Message-ID: <5ad-v7omYA-n9t0xhVYf8O5GrPVq8uCtTYMlNlNp11U=.4dc16e2b-719f-4855-871f-261ff81cd745@github.com> On Thu, 1 Jun 2023 13:44:36 GMT, Jan Kratochvil wrote: >> It was needed for some previous variant of the patch and it is a leftover. Removed. Sorry for not self-reviewing this version of my patch. > > It is in fact needed: https://github.com/openjdk/crac/blob/08c2c0f2b140b1011fec39a1a83037df8e59577d/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp#L217 > During CRaC restore there is somehow no Java thread created yet that time. I do not understand this much so I left it there as working but maybe some Java thread should be initialized earlier? Sorry, I've missed the comment in the code. Looks OK. Since verify_cpu_compatibility is called from VMThread, which is not JavaThreads, the thread is not passed from the caller. https://github.com/openjdk/crac/blob/08c2c0f2b140b1011fec39a1a83037df8e59577d/src/hotspot/os/posix/signals_posix.cpp#L627 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1222925825 From akozlov at openjdk.org Thu Jun 8 13:33:24 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 13:33:24 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: <868NIuoT3iFtWIOtQjyAIaRYrjeJQX2YGsz_RbwCvTM=.b20b47c4-4710-4974-9059-6f9934bd4ba9@github.com> On Thu, 18 May 2023 13:57:09 GMT, Jan Kratochvil wrote: >> src/hotspot/share/runtime/stubCodeGenerator.cpp line 62: >> >>> 60: void StubCodeDesc::thaw() { >>> 61: assert(_frozen, "repeated thaw operation"); >>> 62: _frozen = false; >> >> Is it still necessary? I've tried to comment this line out, and checkpoint-restore succeded for me. > > I get during restore: > > # Internal Error (../../src/hotspot/share/runtime/stubCodeGenerator.hpp:72), pid=12265, tid=12273 > # assert(!_frozen) failed: no modifications allowed > > Did you really use a `*debug` build (and not `release` build)? The crash above has been generated on `slowdebug`. Indeed, something wrong was with my setup. The complete stack trace Stack: [0x00007f7f6eb00000,0x00007f7f6ec00000], sp=0x00007f7f6ebfbc70, free space=1007k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x171395c] StubCodeMark::StubCodeMark(StubCodeGenerator*, char const*, char const*)+0x19c V [libjvm.so+0x1923334] VM_Version_StubGenerator::generate_get_cpu_info()+0x354 V [libjvm.so+0x1921997] VM_Version::initialize_features()+0x157 V [libjvm.so+0x13cddb5] VM_Crac::verify_cpu_compatibility()+0x65 V [libjvm.so+0x13d5a3b] VM_Crac::doit()+0x4cb V [libjvm.so+0x18eeb53] VM_Operation::evaluate()+0x213 V [libjvm.so+0x191307c] VMThread::evaluate_operation(VM_Operation*)+0x19c V [libjvm.so+0x1913d1d] VMThread::inner_execute(VM_Operation*)+0x20d V [libjvm.so+0x1913f55] VMThread::loop()+0xb5 V [libjvm.so+0x1914089] VMThread::run()+0xc9 V [libjvm.so+0x180e9f8] Thread::call_run()+0xf8 V [libjvm.so+0x13d04c1] thread_native_entry(Thread*)+0x101 The generate_get_cpu_info is called at https://github.com/openjdk/crac/blob/241a3465c1358e6f9bcabb03aaa211d6cd3cfaa9/src/hotspot/cpu/x86/vm_version_x86.cpp#L1898, and the stub is preserved. Can we just avoid code-generatation on repeated calls of initialize_features? And drop thaw() completely? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1222916730 From akozlov at openjdk.org Thu Jun 8 13:33:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 13:33:23 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v30] In-Reply-To: References: Message-ID: On Wed, 7 Jun 2023 12:57:59 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with two additional commits since the last revision: > > - Fix hotspot 'ht' vs. glibc 'htt'. > - CPUFeatures refactorization. > Start CPU Features checking without libc. I meet with the same error as reported by GHA https://github.com/jankratochvil/crac/actions/runs/5200182180/jobs/9378587527 Replacing vm_exit_during_initialization() with fatal() at least provides the hs_err and the stack trace. Stack: [0x00007f80d6900000,0x00007f80d6a00000], sp=0x00007f80d69fd760, free space=1013k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x191bad4] VM_Version::glibc_not_using(unsigned long, unsigned long)+0x414 V [libjvm.so+0x191d3a2] VM_Version::CPUFeatures_init()+0x1d2 V [libjvm.so+0x191d67f] VM_Version::get_processor_features()+0xef V [libjvm.so+0x1921f16] VM_Version::initialize_features()+0x546 V [libjvm.so+0x1922359] VM_Version::initialize()+0x9 V [libjvm.so+0x1916756] VM_Version_init()+0x26 V [libjvm.so+0xd30b4c] init_globals()+0x2c V [libjvm.so+0x18110e6] Threads::create_vm(JavaVMInitArgs*, bool*)+0x326 V [libjvm.so+0xeb86d9] JNI_CreateJavaVM+0x69 C [libjli.so+0x3da4] JavaMain+0x94 C [libjli.so+0x7719] ThreadJavaMain+0x9 anton at mercury:~/proj/crac$ git diff diff --git a/src/hotspot/cpu/x86/vm_version_x86.cpp b/src/hotspot/cpu/x86/vm_version_x86.cpp index a6a04458319..58249612750 100644 --- a/src/hotspot/cpu/x86/vm_version_x86.cpp +++ b/src/hotspot/cpu/x86/vm_version_x86.cpp @@ -1225,10 +1225,9 @@ void VM_Version::glibc_not_using(uint64_t excessive_CPU, uint64_t excessive_GLIB #ifdef ASSERT #define CHECK_KIND(kind) do { \ if (PASTE_TOKENS(disable_handled_, kind) != PASTE_TOKENS(excessive_handled_, kind)) { \ - jio_snprintf(errbuf, sizeof(errbuf), \ + fatal( \ "internal error: Unsupported disabling of " STR(kind) "_* 0x%" PRIx64 " != used 0x%" PRIx64, \ PASTE_TOKENS(disable_handled_, kind), PASTE_TOKENS(excessive_handled_, kind)); \ - vm_exit_during_initialization(errbuf); \ } \ } while (0) CHECK_KIND(CPU ); I'm not 100% sure fatal() is correct in that state, so I propose a vararg macro/function that expands to fatal(...), which can easily be replaced with something different. ------------- PR Review: https://git.openjdk.org/crac/pull/41#pullrequestreview-1469711629 From akozlov at openjdk.org Thu Jun 8 16:19:21 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 16:19:21 GMT Subject: [crac] RFR: Make CheckpointException/RestoreException aggregate-only [v8] In-Reply-To: References: Message-ID: On Thu, 8 Jun 2023 11:50:31 GMT, Radim Vansa wrote: >> CheckpointException and RestoreException (both jdk.crac and javax.crac) cannot carry any message, cause and don't collect stack trace. The sole purpose of these is to aggregate other exceptions in suppressed exceptions. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Update javadoc, make final Thank you! LGTM ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/64#pullrequestreview-1470338238 From akozlov at openjdk.org Thu Jun 8 16:51:18 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 16:51:18 GMT Subject: [crac] RFR: Draft: Move more FD tracking to java layer In-Reply-To: <5S1ePK22x8ZfSuDvQHrh1vJkbSlBj3-KNGmNEmGVMIo=.3521df6d-fe72-48f5-bd23-552f6ab87561@github.com> References: <5S1ePK22x8ZfSuDvQHrh1vJkbSlBj3-KNGmNEmGVMIo=.3521df6d-fe72-48f5-bd23-552f6ab87561@github.com> Message-ID: On Wed, 7 Jun 2023 12:34:31 GMT, Radim Vansa wrote: >> The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. >> >> We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). >> >> >> Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] >> at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) >> at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) >> ... 7 more >> >> >> A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. >> >> I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. > > test/jdk/jdk/crac/fileDescriptors/OpenFileDetectionTest.java line 51: > >> 49: @Override >> 50: public void exec() throws Exception { >> 51: try (var file = new RandomAccessFile("/etc/passwd", "r")) { > > What would happen with the FileReader? It would be better to not remove test for particular case, but add one for RandomAccessFile. FileReader was never claiming own file descriptor, it was a mistake in the original version of the test. Now test matches RandomAccessFile, that does claiming. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1223312810 From akozlov at openjdk.org Thu Jun 8 17:30:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 17:30:23 GMT Subject: [crac] RFR: Draft: Move more FD tracking to java layer In-Reply-To: References: Message-ID: On Wed, 7 Jun 2023 10:51:41 GMT, Anton Kozlov wrote: > The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. > > We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). > > > Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] > at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) > at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) > ... 7 more > > > A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. > > I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. Motiviation for the PR is described in the description. You're right that the change mostly a refactoring. To repeat, this continues non-Linux CRaC implementations with the semantic shared between them, and this is also a foundation for optional checkpoint-restore resource policies implemented in the Socket, File,.. classes. A higher level object like Socket or File usually has more information regarding FD and the uses, thus it can provide more information or implement wider set of policies. The chain of ownership will become a real problem once on every other layer we have more information that we want to report. In your example, FileWriter will have to claim FD if it has more information about that. But immediatly this is not clear. Stack traces are a great addition to the debug workflow, they describe the state on the moment of creation. But heap dump describes the state at the checkpoint exception. They provide different aspects, one or another may suit better different conditions. ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1583064896 From akozlov at openjdk.org Thu Jun 8 17:44:14 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 17:44:14 GMT Subject: [crac] RFR: Linux file system watcher support [v4] In-Reply-To: <3Jl4KmUfedZAAzBzRZQZu1hJpBpdwvzilrUpKX4KcZo=.3c77ae78-59ee-481c-a0ff-41e03c32270a@github.com> References: <3Jl4KmUfedZAAzBzRZQZu1hJpBpdwvzilrUpKX4KcZo=.3c77ae78-59ee-481c-a0ff-41e03c32270a@github.com> Message-ID: On Thu, 25 May 2023 09:26:01 GMT, joeylee wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > joeylee has updated the pull request incrementally with one additional commit since the last revision: > > update LGTM, thank you! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/72#pullrequestreview-1470473093 From akozlov at openjdk.org Thu Jun 8 18:29:06 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 18:29:06 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: Message-ID: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> On Fri, 12 May 2023 13:29:08 GMT, Radim Vansa wrote: >> When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. >> >> These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: >> >> * numeric file descriptor >> * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) >> * keywords FIFO and SOCKET that match pipes and sockets >> >> The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Effectively revert previous commit: Initialize logger in src/java.base/share/classes/java/io/FileDescriptor.java line 67: > 65: private String originalPath; > 66: private int originalFlags; > 67: private long originalOffset; Regarding the design, path and offset are features of a file desriptor referring to the filesystem, but they do not make sense to socket FD. src/java.base/share/classes/java/io/FileDescriptor.java line 456: > 454: // this is used probably as the file moved on the filesystem but the contents > 455: // are the same. > 456: if (!reopen(resource.originalFd, path, resource.originalFlags, resource.originalOffset)) { Another option is a log file in the append mode, which may grow larger -- in that case we'd like to have that in append mode with position at the end. Probably that is handled by setting proper flags, but at least this would contradict with the comment. src/java.base/share/classes/jdk/crac/impl/OpenFDPolicies.java line 23: > 21: public class OpenFDPolicies

{ > 22: public static final String CHECKPOINT_PROPERTY = "jdk.crac.fd-policy.checkpoint"; > 23: public static final String RESTORE_PROPERTY = "jdk.crac.fd-policy.restore"; Having separated policies for checkpoint and restore enables some weird configuration, when e.g. checkpoint specifing CLOSE and restore -- REOPEN. It would be better to have a combined, consisted checkpoint-restore policy that specifes both parts at once. src/java.base/share/classes/jdk/crac/impl/OpenFDPolicies.java line 228: > 226: return fifoPolicy; > 227: } else if (type.equals("socket")) { > 228: return socketPolicy; Obviously, only a single policy is possible for all sockets. In general sockets are much harder than files, and in not so many cases we can automatically handle them. I think some connection-less sockets _may_ be covered, and some listening sockets. If we continue this implementation, FileDescriptor (a rather simple object intially) will grow larger and larger, knowing about all possible uses in the JDK. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1223386909 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1223382818 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1223394991 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1223400286 From akozlov at openjdk.org Thu Jun 8 18:47:13 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 8 Jun 2023 18:47:13 GMT Subject: [crac] RFR: Support passing extra options in CREngine In-Reply-To: References: Message-ID: On Wed, 10 May 2023 08:58:54 GMT, Radim Vansa wrote: > In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,key=value,anotherkey` that translates into invoking `program --key value --anotherkey`. > > This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. src/hotspot/share/runtime/globals.hpp line 2100: > 2098: "as a comma-separated list of key[=value] pairs; " \ > 2099: "-XX:CREngine=program,key=value,anotherkey results in calling " \ > 2100: "'program --key value --anotherkey'") \ There is CRAC_CRIU_OPTS that is used to pass additional options to CRIU. I see that remains in place, and this turns out to be alternative? Not every CREngine may want to follow this convention of argument passing. Usually, `program,--key,value,--anotherkey` is specified to call `program --key value --anotherkey`. This probably needs to be RESTORE_SETTABLE. src/java.base/unix/native/criuengine/criuengine.c line 105: > 103: "-D", imagedir, > 104: "--shell-job", > 105: "-v4", "-o", "dump4.log", // -D without -W makes criu cd to image dir for logs The dump4.log is a great debugging aid for checkpoint and restore failures, please keep it. I think a part of the reason for this change is to provide some logging to the console. I think the reporting should be implemented separately https://github.com/openjdk/crac/pull/64#discussion_r1221827050 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1223418528 PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1223421815 From duke at openjdk.org Fri Jun 9 08:53:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 08:53:09 GMT Subject: [crac] RFR: Support passing extra options in CREngine In-Reply-To: References: Message-ID: On Thu, 8 Jun 2023 18:42:13 GMT, Anton Kozlov wrote: >> In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,key=value,anotherkey` that translates into invoking `program --key value --anotherkey`. >> >> This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. > > src/java.base/unix/native/criuengine/criuengine.c line 105: > >> 103: "-D", imagedir, >> 104: "--shell-job", >> 105: "-v4", "-o", "dump4.log", // -D without -W makes criu cd to image dir for logs > > The dump4.log is a great debugging aid for checkpoint and restore failures, please keep it. > > I think a part of the reason for this change is to provide some logging to the console. I think the reporting should be implemented separately https://github.com/openjdk/crac/pull/64#discussion_r1221827050 By default the file is still used, see lines 113 - 116. The args just support overriding those. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1224029871 From duke at openjdk.org Fri Jun 9 09:11:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 09:11:12 GMT Subject: [crac] RFR: Support passing extra options in CREngine In-Reply-To: References: Message-ID: On Thu, 8 Jun 2023 18:38:10 GMT, Anton Kozlov wrote: >> In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,key=value,anotherkey` that translates into invoking `program --key value --anotherkey`. >> >> This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. > > src/hotspot/share/runtime/globals.hpp line 2100: > >> 2098: "as a comma-separated list of key[=value] pairs; " \ >> 2099: "-XX:CREngine=program,key=value,anotherkey results in calling " \ >> 2100: "'program --key value --anotherkey'") \ > > There is CRAC_CRIU_OPTS that is used to pass additional options to CRIU. I see that remains in place, and this turns out to be alternative? > > Not every CREngine may want to follow this convention of argument passing. Usually, `program,--key,value,--anotherkey` is specified to call `program --key value --anotherkey`. > > This probably needs to be RESTORE_SETTABLE. Ouch, I admit that I've missed `CRAC_CRIU_OPTS` (haven't seen it documented anywhere). I would keep it for backwards compatibility, but having options that are explicitly parsed and handed-over by criuengine seems more reliable. As in the case of the verbosity & log file: I wanted to override the default `-v1`. > Not every CREngine may want to follow this convention of argument passing. Usually, program,--key,value,--anotherkey is specified to call program --key value --anotherkey. I have selected the `key=value` syntax rather than simply concatenating args as that is used e.g. by FlightRecorder, and IMO is looks better when passing as VM options. Yes, it might by a hypocrisy since this is just passing those arguments. We should also consider the dynamic library CREngine in #78 - in the draft I am passing the arguments parsed (and turned into `--key` form), but there might be better ways. I don't think it's really up to CREngine to decide, JVM is ruling the API and CREngine is here to adapt JVM expectations to the implementation API. CREngine has a second-class role in here. > This probably needs to be RESTORE_SETTABLE. Yes, this predates the flag, I haven't merged last changes until first review. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1224048712 From duke at openjdk.org Fri Jun 9 09:30:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 09:30:12 GMT Subject: [crac] RFR: Draft: Move more FD tracking to java layer In-Reply-To: References: <5S1ePK22x8ZfSuDvQHrh1vJkbSlBj3-KNGmNEmGVMIo=.3521df6d-fe72-48f5-bd23-552f6ab87561@github.com> Message-ID: On Thu, 8 Jun 2023 16:48:39 GMT, Anton Kozlov wrote: >> test/jdk/jdk/crac/fileDescriptors/OpenFileDetectionTest.java line 51: >> >>> 49: @Override >>> 50: public void exec() throws Exception { >>> 51: try (var file = new RandomAccessFile("/etc/passwd", "r")) { >> >> What would happen with the FileReader? It would be better to not remove test for particular case, but add one for RandomAccessFile. > > FileReader was never claiming own file descriptor, it was a mistake in the original version of the test. Now test matches RandomAccessFile, that does claiming. That was not a mistake; the point of the test was to show that user *gets informed* that it was the `/etc/passwd` file what the FD was pointing to. The point of claiming, as I understand it, is to provide additional info. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1224074917 From duke at openjdk.org Fri Jun 9 09:42:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 09:42:14 GMT Subject: [crac] RFR: Draft: Move more FD tracking to java layer In-Reply-To: References: Message-ID: On Thu, 8 Jun 2023 17:27:24 GMT, Anton Kozlov wrote: > Motiviation for the PR is described in the description. You're right that the change mostly a refactoring. To repeat, this continues non-Linux CRaC implementations with the semantic shared between them, and this is also a foundation for optional checkpoint-restore resource policies implemented in the Socket, File,.. classes. So, this is to support implementation on other platforms where you can't read what is the FD/handle pointing to? What platform is lacking this info? > A higher level object like Socket or File usually has more information regarding FD and the uses, thus it can provide more information or implement wider set of policies. With Socket, you can get the IPs and ports. JVM is mostly providing API for accessing OS-level info. The same for files. But you can get that info quite easily on the FD level without a hierarchical claiming. > The chain of ownership will become a real problem once on every other layer we have more information that we want to report. In your example, FileWriter will have to claim FD if it has more information about that. But immediatly this is not clear. Maybe if the API allowed claiming ownership of any object, rather than just FD (that would be propagated), and you'd provide information from all chain elements (rather than just the topmost). > Stack traces are a great addition to the debug workflow, they describe the state on the moment of creation. But heap dump describes the state at the checkpoint exception. They provide different aspects, one or another may suit better different conditions. OK, I won't argue on the utility, there might be cases. But can you really present a real-world case where the changes that you're introducing get useful? ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1584281791 From duke at openjdk.org Fri Jun 9 09:59:15 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 09:59:15 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Thu, 8 Jun 2023 18:05:01 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Effectively revert previous commit: Initialize logger in > > src/java.base/share/classes/java/io/FileDescriptor.java line 67: > >> 65: private String originalPath; >> 66: private int originalFlags; >> 67: private long originalOffset; > > Regarding the design, path and offset are features of a file desriptor referring to the filesystem, but they do not make sense to socket FD. Yes, you could add special classes for Sockets vs. Files. It's easier, though, to treat this in single POJO and keep the path `null` if it does not make sense in this case. > src/java.base/share/classes/java/io/FileDescriptor.java line 456: > >> 454: // this is used probably as the file moved on the filesystem but the contents >> 455: // are the same. >> 456: if (!reopen(resource.originalFd, path, resource.originalFlags, resource.originalOffset)) { > > Another option is a log file in the append mode, which may grow larger -- in that case we'd like to have that in append mode with position at the end. Probably that is handled by setting proper flags, but at least this would contradict with the comment. This handling is meant for read-only access, too. I don't think that the concept of 'appending' makes sense from CRaC POV - the append flag just says whether you've started at the beginning or at the end; but this aims at transparent continuation of what you were doing as if no checkpoint happened. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224108861 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224107338 From duke at openjdk.org Fri Jun 9 10:04:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 10:04:09 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Thu, 8 Jun 2023 18:12:15 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Effectively revert previous commit: Initialize logger in > > src/java.base/share/classes/jdk/crac/impl/OpenFDPolicies.java line 23: > >> 21: public class OpenFDPolicies

{ >> 22: public static final String CHECKPOINT_PROPERTY = "jdk.crac.fd-policy.checkpoint"; >> 23: public static final String RESTORE_PROPERTY = "jdk.crac.fd-policy.restore"; > > Having separated policies for checkpoint and restore enables some weird configuration, when e.g. checkpoint specifing CLOSE and restore -- REOPEN. It would be better to have a combined, consisted checkpoint-restore policy that specifes both parts at once. I've started with a single policy enum but it turned out the inlined cross product of behaviours was rather a long and repetitive. In fact CLOSE + REOPEN is the combination that makes a perfect sense. You close the file on checkpoint (rather than error out and fail the checkpoint) and then you reopen the same file (rather than opening something else). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224113433 From duke at openjdk.org Fri Jun 9 10:19:17 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 10:19:17 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Fri, 9 Jun 2023 10:01:27 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/impl/OpenFDPolicies.java line 23: >> >>> 21: public class OpenFDPolicies

{ >>> 22: public static final String CHECKPOINT_PROPERTY = "jdk.crac.fd-policy.checkpoint"; >>> 23: public static final String RESTORE_PROPERTY = "jdk.crac.fd-policy.restore"; >> >> Having separated policies for checkpoint and restore enables some weird configuration, when e.g. checkpoint specifing CLOSE and restore -- REOPEN. It would be better to have a combined, consisted checkpoint-restore policy that specifes both parts at once. > > I've started with a single policy enum but it turned out the inlined cross product of behaviours was rather a long and repetitive. > In fact CLOSE + REOPEN is the combination that makes a perfect sense. You close the file on checkpoint (rather than error out and fail the checkpoint) and then you reopen the same file (rather than opening something else). And note that the restore configuration should be set on restore, because in case of opening other files you don't know where these are in the deployment (restore) environment during checkpoint. Had you specified the policy together you would be overriding the behaviour on checkpoint which was already executed, which is even more confusing. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224126609 From duke at openjdk.org Fri Jun 9 10:19:17 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 10:19:17 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Thu, 8 Jun 2023 18:17:15 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Effectively revert previous commit: Initialize logger in > > src/java.base/share/classes/jdk/crac/impl/OpenFDPolicies.java line 228: > >> 226: return fifoPolicy; >> 227: } else if (type.equals("socket")) { >> 228: return socketPolicy; > > Obviously, only a single policy is possible for all sockets. In general sockets are much harder than files, and in not so many cases we can automatically handle them. I think some connection-less sockets _may_ be covered, and some listening sockets. > > If we continue this implementation, FileDescriptor (a rather simple object intially) will grow larger and larger, knowing about all possible uses in the JDK. Someone already asked for automatic reopening of listening sockets. The policy could then e.g. filter specific ports (similar to files). Yes, you can't to anything but KEEP_CLOSED for sockets related to connection. I assume that the restore policy will be set during restore, rather than during checkpoint, so there's no chance to handle a non-sense configuration during checkpoint (because the configuration is not present yet). Growing FileDescriptor has a point, any complex implementation for sockets should be hosted next to other Socket related stuff. For files, I don't know if there's a better place, but we could refactor it to an util class. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224123951 From akozlov at openjdk.org Fri Jun 9 12:05:13 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 12:05:13 GMT Subject: [crac] RFR: Linux file system watcher support [v4] In-Reply-To: <3Jl4KmUfedZAAzBzRZQZu1hJpBpdwvzilrUpKX4KcZo=.3c77ae78-59ee-481c-a0ff-41e03c32270a@github.com> References: <3Jl4KmUfedZAAzBzRZQZu1hJpBpdwvzilrUpKX4KcZo=.3c77ae78-59ee-481c-a0ff-41e03c32270a@github.com> Message-ID: On Thu, 25 May 2023 09:26:01 GMT, joeylee wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > joeylee has updated the pull request incrementally with one additional commit since the last revision: > > update Changes requested by akozlov (Lead). src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 219: > 217: CheckpointRestoreState thisState; > 218: if (wdToKey.size() == 0) { > 219: unsafe.freeMemory(address); Although wait, where is the address is reinitialized? That is created in the constructor, and initFDs() reinitializes only file descriptors. This is very likely to be the reason behind those crashes, I can reproduce one locally. I don't see any reasons to free the memory, it can perfectly be saved to the image. After removing this line, the test has passed 10 of 10 local runs. ------------- PR Review: https://git.openjdk.org/crac/pull/72#pullrequestreview-1471837213 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1224214828 From duke at openjdk.org Fri Jun 9 13:02:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 13:02:14 GMT Subject: [crac] Integrated: Make CheckpointException/RestoreException aggregate-only In-Reply-To: References: Message-ID: On Wed, 10 May 2023 12:20:07 GMT, Radim Vansa wrote: > CheckpointException and RestoreException (both jdk.crac and javax.crac) cannot carry any message, cause and don't collect stack trace. The sole purpose of these is to aggregate other exceptions in suppressed exceptions. This pull request has now been integrated. Changeset: a282698d Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/a282698d2bf01588172e8f54c4cfedf40f203a68 Stats: 85 lines in 7 files changed: 15 ins; 43 del; 27 mod Make CheckpointException/RestoreException aggregate-only Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/64 From akozlov at openjdk.org Fri Jun 9 13:51:10 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 13:51:10 GMT Subject: [crac] RFR: Make CheckpointException/RestoreException aggregate-only [v8] In-Reply-To: References: Message-ID: On Thu, 8 Jun 2023 11:50:31 GMT, Radim Vansa wrote: >> CheckpointException and RestoreException (both jdk.crac and javax.crac) cannot carry any message, cause and don't collect stack trace. The sole purpose of these is to aggregate other exceptions in suppressed exceptions. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Update javadoc, make final src/java.base/share/classes/jdk/crac/Core.java line 89: > 87: if (codes.length == 0) { > 88: exception.addSuppressed(new RuntimeException("Native checkpoint failed.")); > 89: } Turns out this fires on checkpoint dry runs. Suppose a Resource has thrown an exception, we go to JVM just to check no native FD is open. When everything is OK on native side (codes and messages are empty), this exception is thrown, although no native checkpoint is attempted. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1224336281 From akozlov at openjdk.org Fri Jun 9 15:20:13 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 15:20:13 GMT Subject: [crac] RFR: Draft: Move more FD tracking to java layer In-Reply-To: References: Message-ID: <9PncKDptvBL3sNRSBPBIVt1udiTwQ8hLHlQ0cPKTyIU=.c4b0b877-885b-4339-bc8c-8017e0e62f05@github.com> On Fri, 9 Jun 2023 09:39:31 GMT, Radim Vansa wrote: > So, this is to support implementation on other platforms where you can't read what is the FD/handle pointing to? What platform is lacking this info? I'm really not sure about Windows. With this, Socket details will be reported on Windows without additional code for reporting. > With Socket, you can get the IPs and ports. JVM is mostly providing API for accessing OS-level info. The same for files. But you can get that info quite easily on the FD level without a hierarchical claiming. That requires additional code in the FD that needs know all possible file descriptors types on the system, and new interfaces to extract information from them, like you had to do with Sockets. > Maybe if the API allowed claiming ownership of any object, rather than just FD (that would be propagated), and you'd provide information from all chain elements (rather than just the topmost). Yes, I also started thinking about something like this. That may work. We'll have a chance to polish the ClaimFd interface with concrete examples. > But can you really present a real-world case where the changes that you're introducing get useful? Since when this become a prerequisite? :) Refactorings and non-functional changes are fine for CRaC Project. Before the patch, claimFdWeak and claimFd were proto versions of this claming. And you can see, they required "external" orderding between multiple claimers, thus we had PRE_FILE_DESCRIPTOR and NATIVE_PRNG priorities before FILE_DESRIPTOR, to override FileDescriptors throwing an exception. That did not scale well. ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1584751820 From akozlov at openjdk.org Fri Jun 9 15:25:16 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 15:25:16 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Fri, 9 Jun 2023 09:55:06 GMT, Radim Vansa wrote: >> src/java.base/share/classes/java/io/FileDescriptor.java line 456: >> >>> 454: // this is used probably as the file moved on the filesystem but the contents >>> 455: // are the same. >>> 456: if (!reopen(resource.originalFd, path, resource.originalFlags, resource.originalOffset)) { >> >> Another option is a log file in the append mode, which may grow larger -- in that case we'd like to have that in append mode with position at the end. Probably that is handled by setting proper flags, but at least this would contradict with the comment. > > This handling is meant for read-only access, too. > I don't think that the concept of 'appending' makes sense from CRaC POV - the append flag just says whether you've started at the beginning or at the end; but this aims at transparent continuation of what you were doing as if no checkpoint happened. I don't completely follow. Consider an example of a log file in append write mode (every write guaranteed to append to file, even with concurrent writes). As for me, it's perfectly fine to restore with that log, even with different offset, since offset does not make a lot of sense with append mode. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224448829 From duke at openjdk.org Fri Jun 9 15:31:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 15:31:14 GMT Subject: [crac] RFR: Make CheckpointException/RestoreException aggregate-only [v8] In-Reply-To: References: Message-ID: On Fri, 9 Jun 2023 13:48:32 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Update javadoc, make final > > src/java.base/share/classes/jdk/crac/Core.java line 89: > >> 87: if (codes.length == 0) { >> 88: exception.addSuppressed(new RuntimeException("Native checkpoint failed.")); >> 89: } > > Turns out this fires on checkpoint dry runs. Suppose a Resource has thrown an exception, we go to JVM just to check no native FD is open. When everything is OK on native side (codes and messages are empty), this exception is thrown, although no native checkpoint is attempted. Shouldn't a dry-run return `JVM_CHECKPOINT_OK`? Which test is that, `DryRunTest` works for me? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1224455064 From akozlov at openjdk.org Fri Jun 9 15:34:16 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 15:34:16 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Fri, 9 Jun 2023 10:16:02 GMT, Radim Vansa wrote: >> I've started with a single policy enum but it turned out the inlined cross product of behaviours was rather a long and repetitive. >> In fact CLOSE + REOPEN is the combination that makes a perfect sense. You close the file on checkpoint (rather than error out and fail the checkpoint) and then you reopen the same file (rather than opening something else). > > And note that the restore configuration should be set on restore, because in case of opening other files you don't know where these are in the deployment (restore) environment during checkpoint. Had you specified the policy together you would be overriding the behaviour on checkpoint which was already executed, which is even more confusing. > I've started with a single policy enum but it turned out the inlined cross product of behaviours was rather a long and repetitive. Do you have artifacts of that? Because it may mean that cross-product may have a few invalid / unsafe behaviors. It would be interesting to look at the list. > In fact CLOSE + REOPEN is the combination that makes a perfect sense. Indeed, bad example. What about ERROR+OPEN_OTHER. Or for the sockets, for the current implementation we should not allow OPEN_OTHER, as the implementation can only throw RestoreException in this case. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224457914 From akozlov at openjdk.org Fri Jun 9 15:37:10 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 15:37:10 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Fri, 9 Jun 2023 15:31:20 GMT, Anton Kozlov wrote: >> And note that the restore configuration should be set on restore, because in case of opening other files you don't know where these are in the deployment (restore) environment during checkpoint. Had you specified the policy together you would be overriding the behaviour on checkpoint which was already executed, which is even more confusing. > >> I've started with a single policy enum but it turned out the inlined cross product of behaviours was rather a long and repetitive. > > Do you have artifacts of that? Because it may mean that cross-product may have a few invalid / unsafe behaviors. It would be interesting to look at the list. > >> In fact CLOSE + REOPEN is the combination that makes a perfect sense. > > Indeed, bad example. What about ERROR+OPEN_OTHER. Or for the sockets, for the current implementation we should not allow OPEN_OTHER, as the implementation can only throw RestoreException in this case. > And note that the restore configuration should be set on restore, because in case of opening other files you don't know where these are in the deployment (restore) environment during checkpoint. Had you specified the policy together you would be overriding the behaviour on checkpoint which was already executed, which is even more confusing. The combined policy can have additional parameters settable on restore. But the combined policy should be able to limit policies or parameters for restore part of the policy. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224461715 From duke at openjdk.org Fri Jun 9 15:41:10 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 15:41:10 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Fri, 9 Jun 2023 15:22:34 GMT, Anton Kozlov wrote: >> This handling is meant for read-only access, too. >> I don't think that the concept of 'appending' makes sense from CRaC POV - the append flag just says whether you've started at the beginning or at the end; but this aims at transparent continuation of what you were doing as if no checkpoint happened. > > I don't completely follow. Consider an example of a log file in append write mode (every write guaranteed to append to file, even with concurrent writes). As for me, it's perfectly fine to restore with that log, even with different offset, since offset does not make a lot of sense with append mode. What I was trying to say that it's not about reading the 'append' flag when the file was opened, because that was just saying "let's open and then seek to the end of the file". We shouldn't conclude that we want to start at the 'new end' (provided that the file was modified) on restore. Having a policy that will reopen at end, regardless of original offset, is certainly a valid option. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224466586 From duke at openjdk.org Fri Jun 9 15:48:07 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 9 Jun 2023 15:48:07 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Fri, 9 Jun 2023 15:34:44 GMT, Anton Kozlov wrote: >>> I've started with a single policy enum but it turned out the inlined cross product of behaviours was rather a long and repetitive. >> >> Do you have artifacts of that? Because it may mean that cross-product may have a few invalid / unsafe behaviors. It would be interesting to look at the list. >> >>> In fact CLOSE + REOPEN is the combination that makes a perfect sense. >> >> Indeed, bad example. What about ERROR+OPEN_OTHER. Or for the sockets, for the current implementation we should not allow OPEN_OTHER, as the implementation can only throw RestoreException in this case. > >> And note that the restore configuration should be set on restore, because in case of opening other files you don't know where these are in the deployment (restore) environment during checkpoint. Had you specified the policy together you would be overriding the behaviour on checkpoint which was already executed, which is even more confusing. > > The combined policy can have additional parameters settable on restore. But the combined policy should be able to limit policies or parameters for restore part of the policy. When the behaviour is ERROR, you won't have a chance to use OPEN_OTHER, because you never get to restore. If you set policy OPEN_OTHER on FD 42, the restore will throw an exception. We could completely fail the restore and exit, too. So yes, we do not allow that. We could fail when already reading policies that look like SOCKET=OPEN_OTHER:..., why not add an extra check. But failing when we actually try to do that is only a fraction of second later. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224473302 From akozlov at openjdk.org Fri Jun 9 15:51:05 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 15:51:05 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: <6x2UTOOK7xjTnx4D8z9h-G1T1Dizs_RGlhoy7YVIGtk=.c23cdbfc-41b6-4fbd-9055-7f107a664317@github.com> Message-ID: On Fri, 9 Jun 2023 10:12:56 GMT, Radim Vansa wrote: > Someone already asked for automatic reopening of listening sockets. It was me :) > The policy could then e.g. filter specific ports (similar to files). Yes, you can't to anything but KEEP_CLOSED for sockets related to connection. Connection-less sockets may be fine as well. Maybe even stream sockets, with a big disclaimer that the policy administrator is fully responsible for breaking communication logic. By default, we should do the most safest thing. > I assume that the restore policy will be set during restore, rather than during checkpoint, so there's no chance to handle a non-sense configuration during checkpoint (because the configuration is not present yet). I think this can be implemented https://github.com/openjdk/crac/pull/69#discussion_r1224461715 > Growing FileDescriptor has a point, any complex implementation for sockets should be hosted next to other Socket related stuff. For files, I don't know if there's a better place, but we could refactor it to an util class. #79 does that with JDKFIleResource, which provides a limited hard-coded poilcy for files: exceptions are thrown only for files not in classpath. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1224477524 From akozlov at openjdk.org Fri Jun 9 16:10:07 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 16:10:07 GMT Subject: [crac] RFR: Move more FD tracking to java layer In-Reply-To: References: Message-ID: On Wed, 7 Jun 2023 10:51:41 GMT, Anton Kozlov wrote: > The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. > > We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). > > > Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] > at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) > at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) > ... 7 more > > > A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. > > I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. Now I spot an issue in the current state that may be related anton at mercury:~/proj/crac$ git show -s commit a282698d2bf01588172e8f54c4cfedf40f203a68 (HEAD -> crac, jdk/crac/crac) Author: Radim Vansa Date: Fri Jun 9 12:59:18 2023 +0000 Make CheckpointException/RestoreException aggregate-only Reviewed-by: akozlov There is an expected exception a jar file provided to classpath with a very simple .java application. anton at mercury:~/proj/crac/test-jtreg$ ./../build/linux-x86_64-server-release/images/jdk/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+CRPrintResourcesOnCheckpoint -XX:CREngine=simengine -XX:+CRHeapDumpOnCheckpointException -XX:CRaCCheckpointTo=cr -Djdk.crac.ResourceManager.debug=true -DcallCR -Djdk.crac.collect-fd-stacktraces=true -cp /home/anton/.m2/repository/org/crac/crac/1.3.0/crac-1.3.0.jar Test.java beforeCheckpoint Jun 09, 2023 5:17:28 PM jdk.internal.crac.LoggerContainer info INFO: /home/anton/.m2/repository/org/crac/crac/1.3.0/crac-1.3.0.jar is recorded as always available on restore Jun 09, 2023 5:17:28 PM jdk.internal.crac.LoggerContainer info INFO: /home/anton/.m2/repository/org/crac/crac/1.3.0/crac-1.3.0.jar is recorded as always available on restore JVM: FD fd=0 type=character path="/dev/pts/43"OK: claimed by java code JVM: FD fd=1 type=character path="/dev/pts/43"OK: claimed by java code JVM: FD fd=2 type=character path="/dev/pts/43"OK: claimed by java code JVM: FD fd=3 type=regular path="/home/anton/proj/crac/build/linux-x86_64-server-release/images/jdk/lib/modules"OK: inherited from process env JVM: FD fd=4 type=regular path="/home/anton/.m2/repository/org/crac/crac/1.3.0/crac-1.3.0.jar"OK: claimed by java code Dumping heap to java_pid525343.hprof ... Heap dump file created [9190694 bytes in 0.022 secs] afterRestore Exception in thread "main" org.crac.CheckpointException at org.crac.Core$Compat.checkpointRestore(Core.java:144) at org.crac.Core.checkpointRestore(Core.java:237) at Test.main(Test.java:57) Suppressed: jdk.crac.impl.CheckpointOpenFileException: FileDescriptor 6 left open: /home/anton/.m2/repository/org/crac/crac/1.3.0/crac-1.3.0.jar (regular) at java.base/java.io.FileDescriptor.beforeCheckpoint(FileDescriptor.java:381) at java.base/java.io.FileDescriptor$Resource.beforeCheckpoint(FileDescriptor.java:82) at java.base/jdk.crac.impl.AbstractContext.invokeBeforeCheckpoint(AbstractContext.java:44) at java.base/jdk.crac.impl.AbstractContext.beforeCheckpoint(AbstractContext.java:59) at java.base/jdk.crac.impl.BlockingOrderedContext.beforeCheckpoint(BlockingOrderedContext.java:40) at java.base/jdk.crac.impl.AbstractContext.invokeBeforeCheckpoint(AbstractContext.java:44) at java.base/jdk.crac.impl.AbstractContext.beforeCheckpoint(AbstractContext.java:59) at java.base/jdk.crac.impl.BlockingOrderedContext.beforeCheckpoint(BlockingOrderedContext.java:40) at java.base/jdk.crac.Core.checkpointRestore1(Core.java:120) at java.base/jdk.crac.Core.checkpointRestore(Core.java:268) at java.base/jdk.crac.Core.checkpointRestore(Core.java:247) at java.base/javax.crac.Core.checkpointRestore(Core.java:71) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.crac.Core$Compat.checkpointRestore(Core.java:141) at org.crac.Core.checkpointRestore(Core.java:237) at Test.main(Test.java:57) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at jdk.compiler/com.sun.tools.javac.launcher.Main.execute(Main.java:419) at jdk.compiler/com.sun.tools.javac.launcher.Main.run(Main.java:192) at jdk.compiler/com.sun.tools.javac.launcher.Main.main(Main.java:132) Caused by: java.lang.Exception: This file descriptor was created by main at epoch:1686320248160 here at java.base/java.io.FileDescriptor$Resource.(FileDescriptor.java:72) at java.base/java.io.FileDescriptor.(FileDescriptor.java:97) at java.base/sun.nio.fs.UnixChannelFactory.open(UnixChannelFactory.java:290) at java.base/sun.nio.fs.UnixChannelFactory.newFileChannel(UnixChannelFactory.java:133) at java.base/sun.nio.fs.UnixChannelFactory.newFileChannel(UnixChannelFactory.java:146) at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:217) at java.base/java.nio.file.Files.newByteChannel(Files.java:380) at java.base/java.nio.file.Files.newByteChannel(Files.java:432) at jdk.zipfs/jdk.nio.zipfs.ZipFileSystem.(ZipFileSystem.java:172) at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.getZipFileSystem(ZipFileSystemProvider.java:125) at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.newFileSystem(ZipFileSystemProvider.java:120) at jdk.compiler/com.sun.tools.javac.file.JavacFileManager$ArchiveContainer.(JavacFileManager.java:567) at jdk.compiler/com.sun.tools.javac.file.JavacFileManager.getContainer(JavacFileManager.java:331) at jdk.compiler/com.sun.tools.javac.file.JavacFileManager.pathsAndContainers(JavacFileManager.java:1075) at jdk.compiler/com.sun.tools.javac.file.JavacFileManager.indexPathsAndContainersByRelativeDirectory(JavacFileManager.java:1030) at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1219) at jdk.compiler/com.sun.tools.javac.file.JavacFileManager.pathsAndContainers(JavacFileManager.java:1018) at jdk.compiler/com.sun.tools.javac.file.JavacFileManager.list(JavacFileManager.java:774) at java.compiler at 17-internal/javax.tools.ForwardingJavaFileManager.list(ForwardingJavaFileManager.java:79) at jdk.compiler/com.sun.tools.javac.code.ClassFinder.list(ClassFinder.java:737) at jdk.compiler/com.sun.tools.javac.code.ClassFinder.scanUserPaths(ClassFinder.java:681) at jdk.compiler/com.sun.tools.javac.code.ClassFinder.fillIn(ClassFinder.java:555) at jdk.compiler/com.sun.tools.javac.code.ClassFinder.complete(ClassFinder.java:299) at jdk.compiler/com.sun.tools.javac.code.Symtab.lambda$addRootPackageFor$8(Symtab.java:810) at jdk.compiler/com.sun.tools.javac.code.Symbol.complete(Symbol.java:682) at jdk.compiler/com.sun.tools.javac.comp.Enter.visitTopLevel(Enter.java:356) at jdk.compiler/com.sun.tools.javac.tree.JCTree$JCCompilationUnit.accept(JCTree.java:544) at jdk.compiler/com.sun.tools.javac.comp.Enter.classEnter(Enter.java:286) at jdk.compiler/com.sun.tools.javac.comp.Enter.classEnter(Enter.java:301) at jdk.compiler/com.sun.tools.javac.comp.Enter.complete(Enter.java:603) at jdk.compiler/com.sun.tools.javac.comp.Enter.main(Enter.java:587) at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.enterTrees(JavaCompiler.java:1042) at jdk.compiler/com.sun.tools.javac.main.JavaCompiler.compile(JavaCompiler.java:917) at jdk.compiler/com.sun.tools.javac.api.JavacTaskImpl.lambda$doCall$0(JavacTaskImpl.java:104) at jdk.compiler/com.sun.tools.javac.api.JavacTaskImpl.invocationHelper(JavacTaskImpl.java:152) at jdk.compiler/com.sun.tools.javac.api.JavacTaskImpl.doCall(JavacTaskImpl.java:100) at jdk.compiler/com.sun.tools.javac.api.JavacTaskImpl.call(JavacTaskImpl.java:94) at jdk.compiler/com.sun.tools.javac.launcher.Main.compile(Main.java:383) at jdk.compiler/com.sun.tools.javac.launcher.Main.run(Main.java:189) ... 1 more I think it may be related to Persistent JarFile, that has been claiming the FD, was collected, and for some time period only FileDescriptor existed, having a chance to report the exception. For some reason I don't see the problem after this change, likely related to the fact that discovering a Resource and Exception generation are distributed in time. I don't think the problem is completely fixed by this, but I think a fix will employ better claiming anyway. ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1584822600 From akozlov at openjdk.org Fri Jun 9 16:33:49 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 16:33:49 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v2] In-Reply-To: References: Message-ID: > The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. > > We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). > > > Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] > at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) > at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) > ... 7 more > > > A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. > > I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: - Merge remote-tracking branch 'jdk/crac/crac' into newfd - Fix a bug in classPathMatching - Remove misleading UnreachableSocket - Merge remote-tracking branch 'jdk/crac/crac' into newfd - Cleanup - Remove AcceptingSocketResource Does not need to exist - Cleanup and extend FileDescriptor nativeDesc - Update - Workaround for the blocking context - Claim closing socket - ... and 85 more: https://git.openjdk.org/crac/compare/a282698d...bcfd07a4 ------------- Changes: https://git.openjdk.org/crac/pull/79/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=79&range=01 Stats: 724 lines in 22 files changed: 370 ins; 277 del; 77 mod Patch: https://git.openjdk.org/crac/pull/79.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/79/head:pull/79 PR: https://git.openjdk.org/crac/pull/79 From akozlov at openjdk.org Fri Jun 9 17:47:17 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 9 Jun 2023 17:47:17 GMT Subject: [crac] RFR: Make CheckpointException/RestoreException aggregate-only [v8] In-Reply-To: References: Message-ID: On Fri, 9 Jun 2023 15:28:49 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/Core.java line 89: >> >>> 87: if (codes.length == 0) { >>> 88: exception.addSuppressed(new RuntimeException("Native checkpoint failed.")); >>> 89: } >> >> Turns out this fires on checkpoint dry runs. Suppose a Resource has thrown an exception, we go to JVM just to check no native FD is open. When everything is OK on native side (codes and messages are empty), this exception is thrown, although no native checkpoint is attempted. > > Shouldn't a dry-run return `JVM_CHECKPOINT_OK`? Which test is that, `DryRunTest` works for me? The test does not cover this case: when a Resource throws something and we go to dry run JVM inspection, but the inspection does not find anything. Turns out, ContextOrderTest fails bacuse the same reason. JVM dry-run now always returns CHECKPOINT_ERROR by implementation. But in general the the code should depend on the inspection result: OK if no offending descriptors found, ERROR (better to be FD_ERROR) when there are open file descriptors. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1224587774 From akozlov at azul.com Fri Jun 9 17:57:59 2023 From: akozlov at azul.com (Anton Kozlov) Date: Fri, 9 Jun 2023 20:57:59 +0300 Subject: Project CRaC to track openjdk/jdk Message-ID: Project CRaC has been developed for a while and has a considerable interest in the Java Community [1]. At this point, we need even further spread of the API, which is not possible without CRaC API eventually appearing in the mainline. Thus, I propose to base future development on top of the openjdk/jdk, tracking master branch. This will also make developers' life a bit easier as we'll automatically get the most recent fixes. The main focus of the development will be the new master-crac branch. But I also propose to keep a branch for jdk17-crac, with a snapshot of the state before the merge of the openjdk/jdk, plus occasional backports of breaking API changes. So, the jdk17-crac still be a version suitable for wider try-out. At the moment we have a pretty long queue of PRs and unfinished works. I propose to concentrate on the long-standing PRs, and topics we've started working on, e.g. configurable CPU features, Context behaviors, and FileDescriptors. Before transition, we need to create an EA build that is not worse than the previous one [1] in terms of quality and usability. Exceptions are possible: we may accept a new enhancement, or postpone an existing PR if it is decided risky enough. The decision will be done on a case by case basis. I'll announce dates when we'll be closer to the finish with PRs. [1] https://github.com/openjdk/crac/tree/crac-17+5 Thanks, Anton From jkratochvil at openjdk.org Sun Jun 11 10:27:46 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Sun, 11 Jun 2023 10:27:46 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v31] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Fix printing missing features on target CPU. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/d3cb3b55..3bf6dfcd Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=30 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=29-30 Stats: 294 lines in 6 files changed: 151 ins; 117 del; 26 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Sun Jun 11 13:57:27 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Sun, 11 Jun 2023 13:57:27 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v32] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Simplify error reporting by err_msg(). ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/3bf6dfcd..9d2298ba Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=31 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=30-31 Stats: 134 lines in 2 files changed: 8 ins; 73 del; 53 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Sun Jun 11 13:59:15 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Sun, 11 Jun 2023 13:59:15 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v30] In-Reply-To: References: Message-ID: <8HZ9khc4s5Ua19t_WqREjVVW1QiXFTu9vFKljdm4cu4=.542078d3-c292-49a7-9e75-adcbe31872ba@github.com> On Thu, 8 Jun 2023 13:30:40 GMT, Anton Kozlov wrote: > I meet with the same error as reported by GHA I have fixed that. I am mostly ignoring GHA as there are too many false positives. The problem was I was using only _release_ builds for some time where it did not reproduce. > Replacing vm_exit_during_initialization() with fatal() at least provides the hs_err and the stack trace. I did not want a stack trace. Those are normal error messages, stack trace would be rather confusing. Please request it again if you really mean it. > I'm not 100% sure fatal() is correct in that state, so I propose a vararg macro/function that expands to fatal(...), which can easily be replaced with something different. I have found there exists **err_msg()** so I have used it. That **jio_snprintf()**+**vm_exit_during_initialization()** are also used at some hotspot places which is why I used that originally like a copycat. Addressed now by: https://github.com/openjdk/crac/pull/41/commits/9d2298ba9d89c1a2f7e5143153c8e280da6dccec ------------- PR Comment: https://git.openjdk.org/crac/pull/41#issuecomment-1586176111 From jkratochvil at openjdk.org Sun Jun 11 14:05:16 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Sun, 11 Jun 2023 14:05:16 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v32] In-Reply-To: References: Message-ID: <_R449Uk0rxsbNqq1haWsn9SrJC03rcrk02wh9CVJidU=.e1d82609-aee5-408a-94ed-ca7fc51d4f59@github.com> On Sun, 11 Jun 2023 13:57:27 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > Simplify error reporting by err_msg(). > Can we just avoid code-generatation on repeated calls of initialize_features? And drop thaw() completely? I have done so now. The original idea was calling it so that if a snapshot is done on old CPU and restored on a new CPU the hotspot will still benefit from new CPU for newly compiled JIT code. But it is true with #57 it may become more questionable what is correct to do. Also the last variant of this patch has no longer been benefiting from this code-regeneration anymore. ------------- PR Comment: https://git.openjdk.org/crac/pull/41#issuecomment-1586177848 From jkratochvil at openjdk.org Sun Jun 11 23:59:44 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Sun, 11 Jun 2023 23:59:44 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v33] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Compatibility with non-X86; untested. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/9d2298ba..5e79982b Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=32 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=31-32 Stats: 13 lines in 7 files changed: 5 ins; 6 del; 2 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Mon Jun 12 08:29:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 08:29:12 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v2] In-Reply-To: References: Message-ID: On Fri, 9 Jun 2023 16:33:49 GMT, Anton Kozlov wrote: >> The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. >> >> We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). >> >> >> Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] >> at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) >> at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) >> ... 7 more >> >> >> A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. >> >> I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. > > Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: > > - Merge remote-tracking branch 'jdk/crac/crac' into newfd > - Fix a bug in classPathMatching > - Remove misleading UnreachableSocket > - Merge remote-tracking branch 'jdk/crac/crac' into newfd > - Cleanup > - Remove AcceptingSocketResource > > Does not need to exist > - Cleanup and extend FileDescriptor nativeDesc > - Update > - Workaround for the blocking context > - Claim closing socket > - ... and 85 more: https://git.openjdk.org/crac/compare/a282698d...bcfd07a4 > I'm really not sure about Windows. With this, Socket details will be reported on Windows without additional code for reporting. Being able to retrieve details from the Socket class is certainly nice, though you haven't removed the socket-describing code and you won't, as there are cases where sockets are created natively without the Socket class and you still want to show info about those. > That requires additional code in the FD that needs to know all possible file descriptors types on the system, and new interfaces to extract information from them, like you had to do with Sockets. ... and this is present again, but this time in the native code. I was trying to extract the bits in a form that Java understands, though getting them was not a one-liner in the native, so I understand that it might be tempting to just format that in native as well and have fewer native methods. I was hoping for eventual reuse. > Since when this becomes a prerequisite? :) Refactorings and non-functional changes are fine for CRaC Project. Two points: - having a case where this is actually helpful explains the general reasoning behind the refactoring - the example can help verifying that the changes are complete (to an extent) > Before the patch, claimFdWeak and claimFd were proto versions of this claiming. As you can see, they required "external" ordering between multiple claimers, thus we had PRE_FILE_DESCRIPTOR and NATIVE_PRNG priorities before FILE_DESRIPTOR, to override FileDescriptors throwing an exception. That did not scale well. The NATIVE_PRNG priority is still present, do you expect to remove those in a follow up? I have realized that this is removing some technical debt part of the `claimFd`/`claimFdWeak` part, and I am good with tracking the ownership graph. However I am worried that while you have 'simplified' some parts in FileDescriptor #69 would reintroduce those, or be forced to pass the FD state info in an opaque way. src/java.base/share/classes/java/io/FileDescriptor.java line 67: > 65: public void beforeCheckpoint(Context context) throws Exception { > 66: if (!closedByNIO && valid()) { > 67: ClaimedFDs ctx = Core.getClaimedFDs(); nit: not a context anymore Please check other uses of `getClaimedFDs` as well. src/java.base/share/classes/jdk/crac/impl/ExceptionHolder.java line 30: > 28: } > 29: > 30: public void reSuppress(Exception e) { Was thinking about a better name... `adoptSuppressed`? src/java.base/share/classes/jdk/internal/crac/ClaimedFDs.java line 50: > 48: } > 49: > 50: public static class Tuple { This class would deserve a name, esp. since it's used outside of this class and doesn't need to be generic. src/java.base/share/classes/jdk/internal/crac/ClaimedFDs.java line 62: > 60: } > 61: > 62: public void claimFd(FileDescriptor fd, Object claimer, Supplier supplier, Object... suppressedClaimers) { Knowing all potential claimers seems limiting, this would require breaking the encapsulation in a more complex chain. I suggest to change this to `claim(Object claimed, Object claimer, Suppler supplier)` and then assume that `claimed == suppressedClaimer`. The implementation would handle the `claimed == claimer` case for FD a bit differently, but e.g. `ZipFile` would not claim FD itself, but only the `RandomAccessFile zfile`. All claims would be recorded (not overwriting anything) and the final claimant would be resolved only after all notifications complete. src/java.base/unix/native/libjava/FileDescriptor_md.c line 117: > 115: } > 116: > 117: static const char* family2str(int family) { I don't think this belongs to FileDescriptor. You have tried to let Java part of FileDescriptor not know about its type, but only by moving socket-specific handling to native code which contradicts the title of this PR. ------------- PR Review: https://git.openjdk.org/crac/pull/79#pullrequestreview-1474173204 PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226195086 PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226178016 PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226179869 PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226192993 PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226205671 From duke at openjdk.org Mon Jun 12 08:32:23 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 08:32:23 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v2] In-Reply-To: References: Message-ID: On Fri, 9 Jun 2023 16:33:49 GMT, Anton Kozlov wrote: >> The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. >> >> We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). >> >> >> Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] >> at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) >> at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) >> ... 7 more >> >> >> A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. >> >> I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. > > Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: > > - Merge remote-tracking branch 'jdk/crac/crac' into newfd > - Fix a bug in classPathMatching > - Remove misleading UnreachableSocket > - Merge remote-tracking branch 'jdk/crac/crac' into newfd > - Cleanup > - Remove AcceptingSocketResource > > Does not need to exist > - Cleanup and extend FileDescriptor nativeDesc > - Update > - Workaround for the blocking context > - Claim closing socket > - ... and 85 more: https://git.openjdk.org/crac/compare/a282698d...bcfd07a4 RE: the PersistenJarFile error - handling the classpath matching only in RandomAccessFile was a workaround, I agree with moving that lower to the FD's resource. ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1586836823 From duke at openjdk.org Mon Jun 12 09:37:16 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 09:37:16 GMT Subject: [crac] RFR: Make CheckpointException/RestoreException aggregate-only [v8] In-Reply-To: References: Message-ID: <1TpFzAoZDSbpfq_TxyazCAR5EiHL9jQEQoQV-kzWY_o=.6744ef72-ab83-428c-aa56-aef46fb46ad5@github.com> On Fri, 9 Jun 2023 17:44:04 GMT, Anton Kozlov wrote: >> Shouldn't a dry-run return `JVM_CHECKPOINT_OK`? Which test is that, `DryRunTest` works for me? > > The test does not cover this case: when a Resource throws something and we go to dry run JVM inspection, but the inspection does not find anything. > > Turns out, ContextOrderTest fails bacuse the same reason. > > JVM dry-run now always returns CHECKPOINT_ERROR by implementation. But in general the the code should depend on the inspection result: OK if no offending descriptors found, ERROR (better to be FD_ERROR) when there are open file descriptors. Fixed in https://github.com/openjdk/crac/pull/80 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1226371508 From duke at openjdk.org Mon Jun 12 09:39:50 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 09:39:50 GMT Subject: [crac] RFR: Do not add generic native checkpoint error in dry runs Message-ID: When a Resource throws an exception we still enter the checks in native code. In such cases the native code should return successful status if there are no further errors found. ------------- Commit messages: - Do not add generic native checkpoint error in dry runs Changes: https://git.openjdk.org/crac/pull/80/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=80&range=00 Stats: 83 lines in 2 files changed: 78 ins; 1 del; 4 mod Patch: https://git.openjdk.org/crac/pull/80.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/80/head:pull/80 PR: https://git.openjdk.org/crac/pull/80 From jkratochvil at openjdk.org Mon Jun 12 09:48:36 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Mon, 12 Jun 2023 09:48:36 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v34] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Fix slowdebug compilation. Split better get_processor_features_hardware/get_processor_features_hotspot(). ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/5e79982b..767b5fd8 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=33 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=32-33 Stats: 21 lines in 1 file changed: 9 ins; 9 del; 3 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Mon Jun 12 09:56:32 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Mon, 12 Jun 2023 09:56:32 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v35] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 110 commits: - Merge branch 'crac' into crac-altstack-cpu-cpuexplicit-strip - Fix slowdebug compilation. Split better get_processor_features_hardware/get_processor_features_hotspot(). - Compatibility with non-X86; untested. - Simplify error reporting by err_msg(). - Fix printing missing features on target CPU. - Fix hotspot 'ht' vs. glibc 'htt'. - CPUFeatures refactorization. Start CPU Features checking without libc. - Reintroduce initialize_processor_count() requiring -XX:+CRaCCPUCountInit. - requested by Anton Kozlov - Remove initialize_processor_count(). - requested by Anton Kozlov - it was crashing for me for 4 CPU <-> 16 CPU moves - Reintroduce the "leftover" code which was not leftover. - ... and 100 more: https://git.openjdk.org/crac/compare/a282698d...7c567e99 ------------- Changes: https://git.openjdk.org/crac/pull/41/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=41&range=34 Stats: 1085 lines in 18 files changed: 1048 ins; 12 del; 25 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From akozlov at openjdk.org Mon Jun 12 11:27:04 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 12 Jun 2023 11:27:04 GMT Subject: [crac] RFR: Do not add generic native checkpoint error in dry runs In-Reply-To: References: Message-ID: On Mon, 12 Jun 2023 09:33:42 GMT, Radim Vansa wrote: > When a Resource throws an exception we still enter the checks in native code. In such cases the native code should return successful status if there are no further errors found. Thank you, LGTM! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/80#pullrequestreview-1474681221 From duke at openjdk.org Mon Jun 12 11:43:16 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 11:43:16 GMT Subject: [crac] Integrated: Do not add generic native checkpoint error in dry runs In-Reply-To: References: Message-ID: <0F-Vd3yDWOrX1IjaDncyBEMCv0G65uACtWvML-q_lYc=.157ccb99-a27e-4c24-8b7e-071820f625eb@github.com> On Mon, 12 Jun 2023 09:33:42 GMT, Radim Vansa wrote: > When a Resource throws an exception we still enter the checks in native code. In such cases the native code should return successful status if there are no further errors found. This pull request has now been integrated. Changeset: ce9c9dac Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/ce9c9dace119265b34fc8cca3524f8e979964839 Stats: 83 lines in 2 files changed: 78 ins; 1 del; 4 mod Do not add generic native checkpoint error in dry runs Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/80 From akozlov at openjdk.org Mon Jun 12 12:13:18 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 12 Jun 2023 12:13:18 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v2] In-Reply-To: References: Message-ID: <4uCclkYhz6GMyOB7AzMmmOQAtr4j0Dn8b79W1Lz7s5M=.bafec13e-8f2a-4063-bb60-38934a1af4cd@github.com> On Mon, 12 Jun 2023 08:26:02 GMT, Radim Vansa wrote: > you haven't removed the socket-describing code and you won't, as there are cases where sockets are created natively without the Socket class and you still want to show info about those. The native FileDescriptor is not employed in this scenario, the native descriptor detection is done by JVM, that code is mostly unchanged. > ... and this is present again, but this time in the native code. I was trying to extract the bits in a form that Java understands, though getting them was not a one-liner in the native, so I understand that it might be tempting to just format that in native as well and have fewer native methods. I was hoping for eventual reuse. As said, FileDescriptor provides a minimal descriptor for JDK debugging itself. At some point the code had no reporting [1], original JVM reporting [2], and the reporting aligned with the current state [3]. I'm fine with all of them, actually. [1] https://github.com/openjdk/crac/pull/79/commits/78d8e0b09a232c559ded14a4602c691b9d51aa36 [2] a change prior [1] [3] https://github.com/openjdk/crac/pull/79/commits/8db2fe487e4a45909351f8de2c8bff881f3d7f93 > Two points: > * having a case where this is actually helpful explains the general reasoning behind the refactoring Let's conclude on Socket is what user creates, not FileDescriptor (see the PR description) > * the example can help verifying that the changes are complete (to an extent) There should be no major drawback. Can you see something? > The NATIVE_PRNG priority is still present, do you expect to remove those in a follow up? No, I don't see it's should be done. NATIVE_PRNG is still there to handle RNG state, independent of FDs. > I am worried that while you have 'simplified' some parts in FileDescriptor #69 would reintroduce those I think rebasing #69 on top of this would benefit the implementations of the policies, it will make them cleaner and more roboust. E.g. policy implementaiton in File can also prevent access to FileDescriptor with CLOSE policy, during small time window between the point FD is closed, but checkpoint is not started yet. ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1587211492 From akozlov at openjdk.org Mon Jun 12 12:19:16 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 12 Jun 2023 12:19:16 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v2] In-Reply-To: References: Message-ID: On Mon, 12 Jun 2023 07:04:30 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: >> >> - Merge remote-tracking branch 'jdk/crac/crac' into newfd >> - Fix a bug in classPathMatching >> - Remove misleading UnreachableSocket >> - Merge remote-tracking branch 'jdk/crac/crac' into newfd >> - Cleanup >> - Remove AcceptingSocketResource >> >> Does not need to exist >> - Cleanup and extend FileDescriptor nativeDesc >> - Update >> - Workaround for the blocking context >> - Claim closing socket >> - ... and 85 more: https://git.openjdk.org/crac/compare/a282698d...bcfd07a4 > > src/java.base/share/classes/jdk/crac/impl/ExceptionHolder.java line 30: > >> 28: } >> 29: >> 30: public void reSuppress(Exception e) { > > Was thinking about a better name... `adoptSuppressed`? I don't like the suggestion. What is the problem with the original name? > src/java.base/share/classes/jdk/internal/crac/ClaimedFDs.java line 62: > >> 60: } >> 61: >> 62: public void claimFd(FileDescriptor fd, Object claimer, Supplier supplier, Object... suppressedClaimers) { > > Knowing all potential claimers seems limiting, this would require breaking the encapsulation in a more complex chain. > I suggest to change this to `claim(Object claimed, Object claimer, Suppler supplier)` and then assume that `claimed == suppressedClaimer`. The implementation would handle the `claimed == claimer` case for FD a bit differently, but e.g. `ZipFile` would not claim FD itself, but only the `RandomAccessFile zfile`. > All claims would be recorded (not overwriting anything) and the final claimant would be resolved only after all notifications complete. In general, yes, that may work. But I want to postpone this until we find a case that will require the fix and do it with example in hands. > src/java.base/unix/native/libjava/FileDescriptor_md.c line 117: > >> 115: } >> 116: >> 117: static const char* family2str(int family) { > > I don't think this belongs to FileDescriptor. You have tried to let Java part of FileDescriptor not know about its type, but only by moving socket-specific handling to native code which contradicts the title of this PR. I can just delete this. I preserved this only because #43 introduced family reporting. Do you think this is useful? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226575666 PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226571724 PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226574946 From duke at openjdk.org Mon Jun 12 12:36:16 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 12:36:16 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v2] In-Reply-To: <4uCclkYhz6GMyOB7AzMmmOQAtr4j0Dn8b79W1Lz7s5M=.bafec13e-8f2a-4063-bb60-38934a1af4cd@github.com> References: <4uCclkYhz6GMyOB7AzMmmOQAtr4j0Dn8b79W1Lz7s5M=.bafec13e-8f2a-4063-bb60-38934a1af4cd@github.com> Message-ID: On Mon, 12 Jun 2023 12:09:59 GMT, Anton Kozlov wrote: > The native FileDescriptor is not employed in this scenario, the native descriptor detection is done by JVM, that code is mostly unchanged. I am not talking about the native *detection*, but rather about this 'minimal descriptor'. That's the part that has to know about FD types, and moving that to native does not make it any simpler. > There should be no major drawback. Can you see something? Let's keep the test with FileReader please. I hate removing tested scenarios; the outcome could change (accompanied with a comment) but it's rare where the behaviour does not make sense after the refactoring. > I think rebasing https://github.com/openjdk/crac/pull/69 on top of this would benefit the implementations of the policies, it will make them cleaner and more roboust. E.g. policy implementaiton in File can also prevent access to FileDescriptor with CLOSE policy, during small time window between the point FD is closed, but checkpoint is not started yet. The way you made `getPath` abstract, expecting the owner of FD to know it has several effects: - You'd need to update every use of FD rather than having this in one place. - Some information (flags, offsets) could not be reproduced reliably, would be complicated or would have to be based on defaults, rather than the actual value. E.g. logical offset of reading differs from physical offset, there's buffering in place (in libc) - I ran into something like this in https://github.com/openjdk/crac/pull/69/files#diff-ac18cee51b4abf767b1ad66d127afe6e609aaef2e66349032ce23635efa6457aR59 (that's why the tested content is past 1MB - I don't remember what was the actual size of the buffer). ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1587254735 From duke at openjdk.org Mon Jun 12 12:46:21 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 12:46:21 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v2] In-Reply-To: References: Message-ID: On Mon, 12 Jun 2023 12:16:42 GMT, Anton Kozlov wrote: >> src/java.base/share/classes/jdk/crac/impl/ExceptionHolder.java line 30: >> >>> 28: } >>> 29: >>> 30: public void reSuppress(Exception e) { >> >> Was thinking about a better name... `adoptSuppressed`? > > I don't like the suggestion. What is the problem with the original name? OK. It's just unclear, when reading the use `re` would be some acronym (just related to **R**estore**E**xceptions), otherwise it doesn't need the camelcasing (https://en.wiktionary.org/wiki/resuppress). Forget about it if you're satisfied with the current one. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/79#discussion_r1226610847 From akozlov at openjdk.org Mon Jun 12 13:07:46 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 12 Jun 2023 13:07:46 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v3] In-Reply-To: References: Message-ID: > The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. > > We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). > > > Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] > at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) > at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) > ... 7 more > > > A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. > > I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 98 commits: - Merge remote-tracking branch 'jdk/crac/crac' into newfd - Improve Core <-> ClaimedFDs interface - ClaimedFDs is not ctx - Merge remote-tracking branch 'jdk/crac/crac' into newfd - Fix a bug in classPathMatching - Remove misleading UnreachableSocket - Merge remote-tracking branch 'jdk/crac/crac' into newfd - Cleanup - Remove AcceptingSocketResource Does not need to exist - Cleanup and extend FileDescriptor nativeDesc - ... and 88 more: https://git.openjdk.org/crac/compare/ce9c9dac...5e417905 ------------- Changes: https://git.openjdk.org/crac/pull/79/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=79&range=02 Stats: 752 lines in 21 files changed: 397 ins; 277 del; 78 mod Patch: https://git.openjdk.org/crac/pull/79.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/79/head:pull/79 PR: https://git.openjdk.org/crac/pull/79 From akozlov at openjdk.org Mon Jun 12 13:15:05 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 12 Jun 2023 13:15:05 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v4] In-Reply-To: References: Message-ID: > The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. > > We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). > > > Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] > at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) > at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) > ... 7 more > > > A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. > > I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: reSuppress -> resuppress ------------- Changes: - all: https://git.openjdk.org/crac/pull/79/files - new: https://git.openjdk.org/crac/pull/79/files/5e417905..cd87d5c8 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=79&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=79&range=02-03 Stats: 4 lines in 2 files changed: 0 ins; 1 del; 3 mod Patch: https://git.openjdk.org/crac/pull/79.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/79/head:pull/79 PR: https://git.openjdk.org/crac/pull/79 From akozlov at openjdk.org Mon Jun 12 13:48:15 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 12 Jun 2023 13:48:15 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v4] In-Reply-To: References: Message-ID: On Mon, 12 Jun 2023 13:15:05 GMT, Anton Kozlov wrote: >> The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. >> >> We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). >> >> >> Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] >> at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) >> at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) >> ... 7 more >> >> >> A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. >> >> I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > reSuppress -> resuppress > I am not talking about the native _detection_, but rather about this 'minimal descriptor'. That's the part that has to know about FD types, and moving that to native does not make it any simpler. Minimal FD reporting can be removed. I don't wan't to do this right now, as it's very likely java reporting does not cover many cases. But once we've get more java reporting, we'll be able to remove file descriptor altogether. Although, I can remove it right now https://github.com/openjdk/crac/pull/79#discussion_r1226574946 > > There should be no major drawback. Can you see something? > > Let's keep the test with FileReader please. I hate removing tested scenarios; the outcome could change (accompanied with a comment) but it's rare where the behaviour does not make sense after the refactoring. Exception in thread "main" jdk.crac.CheckpointException Suppressed: jdk.crac.impl.CheckpointOpenResourceException: FileDescriptor 4: regular: /etc/passwd at java.base/java.io.FileDescriptor$Resource.lambda$beforeCheckpoint$0(FileDescriptor.java:76) at java.base/jdk.crac.Core.checkpointRestore1(Core.java:141) at java.base/jdk.crac.Core.checkpointRestore(Core.java:266) at java.base/jdk.crac.Core.checkpointRestore(Core.java:245) at OpenFileDetectionTest.exec(OpenFileDetectionTest.java:53) at jdk.test.lib.crac.CracTest.run(CracTest.java:157) at jdk.test.lib.crac.CracTest.main(CracTest.java:89) Caused by: java.lang.Exception: This file descriptor was created by main at epoch:1686575586305 here at java.base/jdk.internal.crac.JDKFdResource.(JDKFdResource.java:25) at java.base/java.io.FileDescriptor$Resource.(FileDescriptor.java:61) at java.base/java.io.FileDescriptor.(FileDescriptor.java:87) at java.base/java.io.FileInputStream.(FileInputStream.java:154) at java.base/java.io.FileInputStream.(FileInputStream.java:111) at java.base/java.io.FileReader.(FileReader.java:60) at OpenFileDetectionTest.exec(OpenFileDetectionTest.java:52) ... 2 more The existing claiminig code in the RandomFileAccess does nothing in the current code state, as FileInputStream manages its own FileDescriptor. Reverting FileReader in the test would require a new implementation in FileInputStream and extending the test, as that does not cover FileInputStream created with fd. So test was not adequate to the implementation, but now it is. FileReader/FileInputStream are better to be done separately, without bloating this PR. > > I think rebasing #69 on top of this would benefit the implementations of the policies, it will make them cleaner and more roboust. E.g. policy implementaiton in File can also prevent access to FileDescriptor with CLOSE policy, during small time window between the point FD is closed, but checkpoint is not started yet. > > The way you made `getPath` abstract, expecting the owner of FD to know it has several effects: > > * You'd need to update every use of FD rather than having this in one place. Yes, this is a design. Repeating code is abstracted into utillity classes, e.g. JDKFileResrouce, but a more specific File class can implement something different compared to the default. > * Some information (flags, offsets) could not be reproduced reliably, would be complicated or would have to be based on defaults, rather than the actual value. E.g. logical offset of reading differs from physical offset, there's buffering in place (in libc) - I ran into something like this in https://github.com/openjdk/crac/pull/69/files#diff-ac18cee51b4abf767b1ad66d127afe6e609aaef2e66349032ce23635efa6457aR59 (that's why the tested content is past 1MB - I don't remember what was the actual size of the buffer). That's interesting, but anyway should not be handled by FileDescriptor. ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1587375100 From duke at openjdk.org Mon Jun 12 14:54:19 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 14:54:19 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v4] In-Reply-To: References: Message-ID: On Mon, 12 Jun 2023 13:45:26 GMT, Anton Kozlov wrote: > The existing claiminig code in the RandomFileAccess does nothing in the current code state, as FileInputStream manages its own FileDescriptor. Reverting FileReader in the test would require a new implementation in FileInputStream and extending the test, as that does not cover FileInputStream created with fd. So test was not adequate to the implementation, but now it is. FileReader/FileInputStream are better to be done separately, without bloating this PR. I can't parse what you're trying to say. The test was never about RandomFileAccess. The purpose was to validate that if a FD is left open: - some exception is thrown from checkpoint - this exception has sufficient information to help the user identify the cause (file path) Both the past and new implementation are sufficient to accomplish this, the reduction of specificity from CheckpointOpenFileException to CheckpointOpenResourceException is acceptable. > Yes, this is a design. Repeating code is abstracted into utillity classes, e.g. JDKFileResrouce, but a more specific File class can implement something different compared to the default. This is highly impractical, and introduces more space for errors, hunting down unhandled cases. I don't oppose having extra information through claiming ownership on resources, but the for the policies it means that we'll need to invent new ways to obtain necessary info for every higher-level resource, as opposed for just 2 platforms (POSIX and Windows). > but anyway should not be handled by FileDescriptor It should not be handled at all in the ideal case, and it does not need to be handled by FileDescriptor at all, when the information is read straight from OS. This is introducing significant complexity for no clear gains. ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1587505879 From duke at openjdk.org Mon Jun 12 15:04:47 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 15:04:47 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v4] In-Reply-To: References: Message-ID: > When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. > > These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) > * keywords FIFO and SOCKET that match pipes and sockets > > The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - Refactor FileDescriptor resource to separate class - Add REOPEN_AT_END and OPEN_OTHER_AT_END policies - Merge branch 'crac' into newfd-policies - Effectively revert previous commit: Initialize logger in - Simplify workarounds in SimpleConsoleLogger. - Merge branch 'crac' into newfd-policies - Handle open file descriptors with configurable policies When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: * numeric file descriptor * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) * keywords FIFO and SOCKET that match pipes and sockets The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore ------------- Changes: https://git.openjdk.org/crac/pull/69/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=69&range=03 Stats: 1296 lines in 16 files changed: 1221 ins; 69 del; 6 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From duke at openjdk.org Mon Jun 12 15:12:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 12 Jun 2023 15:12:14 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v4] In-Reply-To: References: Message-ID: On Mon, 12 Jun 2023 15:04:47 GMT, Radim Vansa wrote: >> When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. >> >> These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: >> >> * numeric file descriptor >> * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) >> * keywords FIFO and SOCKET that match pipes and sockets >> >> The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Refactor FileDescriptor resource to separate class > - Add REOPEN_AT_END and OPEN_OTHER_AT_END policies > - Merge branch 'crac' into newfd-policies > - Effectively revert previous commit: Initialize logger in > - Simplify workarounds in SimpleConsoleLogger. > - Merge branch 'crac' into newfd-policies > - Handle open file descriptors with configurable policies > > When the application does not close some file descriptors through > Resources we can use `jdk.crac.fd-policy.checkpoint` and > `jdk.crac.fd-policy.restore` to configure the behaviour. > > These properties can specify a list of File.pathSeparator-separated > key=value pairs, where the key can be one of: > > * numeric file descriptor > * path using 'glob' pattern matching (see FileSystem.getPathMatcher() > for details) > * keywords FIFO and SOCKET that match pipes and sockets > > The value should match one of possible values from > OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore Updated: * merged-in last changes from `crac` branch * added `REOPEN_AT_END` and `OPEN_OTHER_AT_END` policies * refactored all the policy handling code into FileDescriptorResource (akin to JDKFileResource, name is not too important) As for the checkpoint policy, the ERROR one probably won't be used (it's the default), so until we support some form of not closing the FD at all (replication use case) anything the user would set would boil down to CLOSE/WARN_CLOSE. Therefore, if you want to set it in a one big policy this would probably mean brain-dead duplication of the options. If you want to present a few examples of how you'd like the user-facing options to look like, I am listening, but I don't see a better way myself. ------------- PR Comment: https://git.openjdk.org/crac/pull/69#issuecomment-1587538322 From akozlov at openjdk.org Tue Jun 13 11:53:35 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 13 Jun 2023 11:53:35 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v5] In-Reply-To: References: Message-ID: > The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. > > We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). > > > Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] > at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) > at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) > ... 7 more > > > A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. > > I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Revert FileInputStream test back ------------- Changes: - all: https://git.openjdk.org/crac/pull/79/files - new: https://git.openjdk.org/crac/pull/79/files/cd87d5c8..e416cc0e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=79&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=79&range=03-04 Stats: 5 lines in 1 file changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/79.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/79/head:pull/79 PR: https://git.openjdk.org/crac/pull/79 From duke at openjdk.org Tue Jun 13 12:13:18 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 13 Jun 2023 12:13:18 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v5] In-Reply-To: References: Message-ID: On Tue, 13 Jun 2023 11:53:35 GMT, Anton Kozlov wrote: >> The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. >> >> We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). >> >> >> Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] >> at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) >> at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) >> ... 7 more >> >> >> A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. >> >> I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Revert FileInputStream test back After discussing some parts of this in a call I approve this PR. ------------- Marked as reviewed by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/79#pullrequestreview-1477005876 From duke at openjdk.org Tue Jun 13 13:03:20 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 13 Jun 2023 13:03:20 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v2] In-Reply-To: References: Message-ID: > In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. > > This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - Do not use dashes - Merge branch 'crac' into engine_params - Support passing extra options in CREngine ------------- Changes: https://git.openjdk.org/crac/pull/63/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=63&range=01 Stats: 157 lines in 5 files changed: 126 ins; 5 del; 26 mod Patch: https://git.openjdk.org/crac/pull/63.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/63/head:pull/63 PR: https://git.openjdk.org/crac/pull/63 From rmarchenko at openjdk.org Tue Jun 13 13:22:11 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 13 Jun 2023 13:22:11 GMT Subject: [crac] RFR: Fixing VMOptionsTest.java Message-ID: test/jdk/jdk/crac/VMOptionsTest.java fails in debug build because it expects that NativeMemoryTracking equals to"off", that is not true in debug build. Exception in thread "main" java.lang.RuntimeException: assertEquals: expected off to equal summary at jdk.test.lib.Asserts.fail(Asserts.java:594) at jdk.test.lib.Asserts.assertEquals(Asserts.java:205) at jdk.test.lib.Asserts.assertEquals(Asserts.java:189) at VMOptionsTest.exec(VMOptionsTest.java:60) at jdk.test.lib.crac.CracTest.run(CracTest.java:155) at jdk.test.lib.crac.CracTest.main(CracTest.java:87) This change fixes the issue by retrieving NMT value before checkpoint, and comparing later. ------------- Commit messages: - Fixing test/jdk/jdk/crac/VMOptionsTest.java Changes: https://git.openjdk.org/crac/pull/81/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=81&range=00 Stats: 5 lines in 1 file changed: 2 ins; 1 del; 2 mod Patch: https://git.openjdk.org/crac/pull/81.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/81/head:pull/81 PR: https://git.openjdk.org/crac/pull/81 From akozlov at openjdk.org Tue Jun 13 13:40:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 13 Jun 2023 13:40:23 GMT Subject: [crac] RFR: Move more FD tracking to java layer [v5] In-Reply-To: References: Message-ID: On Tue, 13 Jun 2023 11:53:35 GMT, Anton Kozlov wrote: >> The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. >> >> We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). >> >> >> Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] >> at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) >> at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) >> ... 7 more >> >> >> A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. >> >> I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Revert FileInputStream test back The test now checks both FileInputStream and RandomAccessFile. The resource handling in higher-level objects is indeed associated with some additional efforts with identifying them, but hopefully, eventually, it will make the implementations cleaner. We'll aim to reduce code duplication with higher-level objects using common utility methods and classes. Thank you! ------------- PR Comment: https://git.openjdk.org/crac/pull/79#issuecomment-1589330737 From akozlov at openjdk.org Tue Jun 13 13:40:26 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 13 Jun 2023 13:40:26 GMT Subject: [crac] Integrated: Move more FD tracking to java layer In-Reply-To: References: Message-ID: On Wed, 7 Jun 2023 10:51:41 GMT, Anton Kozlov wrote: > The PR develops the idea of file descriptors tracking in Java started in #43. In general, that PR had two purposes. First, it provides CheckpointExceptions in terms that are clear for Java developers, improving the experience of developing for CRac. So if a FileDescriptor causes an exception, it's possible to look at the heap dump and find references to the offending FD, or to look at the stack trace when FD was created. And second, Java FD tracking is independent of the platform, so that was the first step to bring CRaC to non-Linux platforms, but that is a bit longer road. > > We can eliminate manual heap inspection, and this is proposed in this PR. A FileDescriptor does not exist on its own but it is owned by some higher-level Java object implementation. So an object can "claim" a FileDescriptor and define how and if to report the FD to the user. E.g. Socket can describe the its port and address without deep inspection of the process internals. Turns out, Socket.toString() provides enough information (but the reporting can be extended later if required). > > > Suppressed: jdk.crac.impl.CheckpointOpenSocketException: Socket[addr=localhost/127.0.0.1,port=39957,localport=41464] > at java.base/java.net.SocketImpl$SocketResource.lambda$beforeCheckpoint$0(SocketImpl.java:123) > at java.base/jdk.crac.Core.lambda$checkpointRestore1$0(Core.java:128) > ... 7 more > > > A FileDescriptor is claiming itself in case there is a bug in JDK that no higher-level object is claiming the FD. FD provides just a very short description just for debugging. With stack trace to FD (which is a very nice debugging aid!), that should be enough to find the containing object and implement claiming. > > I believe this overlaps with #69, which at first glance would benefit a lot from being able to define policies in the domain objects. I'll comment on this after a closer look at the other PR. This pull request has now been integrated. Changeset: d460f8c1 Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/d460f8c110376bd00dc7368df53da9cfddcb0c2b Stats: 756 lines in 21 files changed: 400 ins; 278 del; 78 mod Move more FD tracking to java layer ------------- PR: https://git.openjdk.org/crac/pull/79 From duke at openjdk.org Tue Jun 13 13:46:20 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 13 Jun 2023 13:46:20 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v3] In-Reply-To: References: Message-ID: > In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. > > This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: fixup ------------- Changes: - all: https://git.openjdk.org/crac/pull/63/files - new: https://git.openjdk.org/crac/pull/63/files/7914b70c..945e95ed Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=63&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=63&range=01-02 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/63.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/63/head:pull/63 PR: https://git.openjdk.org/crac/pull/63 From akozlov at openjdk.org Tue Jun 13 14:14:14 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 13 Jun 2023 14:14:14 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v3] In-Reply-To: References: Message-ID: On Tue, 13 Jun 2023 13:46:20 GMT, Radim Vansa wrote: >> In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. >> >> This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > fixup The update made the code simpler, thank you! LGTM with small nits. src/hotspot/os/linux/os_linux.cpp line 440: > 438: static const char* _crengine = NULL; > 439: static char* _crengine_arg_str = NULL; > 440: static const char* _crengine_args[32] = { NULL, NULL, NULL }; Should it be really 3 NULLs? src/hotspot/os/linux/os_linux.cpp line 5945: > 5943: static void add_crengine_arg(const char *arg) { > 5944: for (size_t i = 2; i < ARRAY_SIZE(_crengine_args) - 1; ++i) { > 5945: if (_crengine_args[i] == NULL) { The last empty position can be remembered along _crengine_args ------------- PR Review: https://git.openjdk.org/crac/pull/63#pullrequestreview-1471893822 PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1228177212 PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1228185011 From rmarchenko at openjdk.org Tue Jun 13 14:15:20 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 13 Jun 2023 14:15:20 GMT Subject: [crac] RFR: Fixing VMOptionsTest.java In-Reply-To: References: Message-ID: On Tue, 13 Jun 2023 13:11:32 GMT, Roman Marchenko wrote: > test/jdk/jdk/crac/VMOptionsTest.java fails in debug build because it expects that NativeMemoryTracking equals to"off", that is not true in debug build. > > > Exception in thread "main" java.lang.RuntimeException: assertEquals: expected off to equal summary > at jdk.test.lib.Asserts.fail(Asserts.java:594) > at jdk.test.lib.Asserts.assertEquals(Asserts.java:205) > at jdk.test.lib.Asserts.assertEquals(Asserts.java:189) > at VMOptionsTest.exec(VMOptionsTest.java:60) > at jdk.test.lib.crac.CracTest.run(CracTest.java:155) > at jdk.test.lib.crac.CracTest.main(CracTest.java:87) > > > This change fixes the issue by retrieving NMT value before checkpoint, and comparing later. @rvansa Could you review please? ------------- PR Comment: https://git.openjdk.org/crac/pull/81#issuecomment-1589400641 From duke at openjdk.org Tue Jun 13 15:02:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 13 Jun 2023 15:02:14 GMT Subject: [crac] RFR: Fixing VMOptionsTest.java In-Reply-To: References: Message-ID: On Tue, 13 Jun 2023 14:12:28 GMT, Roman Marchenko wrote: >> test/jdk/jdk/crac/VMOptionsTest.java fails in debug build because it expects that NativeMemoryTracking equals to"off", that is not true in debug build. >> >> >> Exception in thread "main" java.lang.RuntimeException: assertEquals: expected off to equal summary >> at jdk.test.lib.Asserts.fail(Asserts.java:594) >> at jdk.test.lib.Asserts.assertEquals(Asserts.java:205) >> at jdk.test.lib.Asserts.assertEquals(Asserts.java:189) >> at VMOptionsTest.exec(VMOptionsTest.java:60) >> at jdk.test.lib.crac.CracTest.run(CracTest.java:155) >> at jdk.test.lib.crac.CracTest.main(CracTest.java:87) >> >> >> This change fixes the issue by retrieving NMT value before checkpoint, and comparing later. > > @rvansa Could you review please? Hi @wkia , the PR looks correct, but I wonder if we need to complicate things if the NMT default is not the same in different builds. I've picked this simply as a random specimen of the non-manageable opts (with string args); could you just use something else, e.g. `OnOutOfMemoryError`, or if you want to go with non-empty default there's `MetaspaceReclaimPolicy`? ------------- PR Comment: https://git.openjdk.org/crac/pull/81#issuecomment-1589485965 From duke at openjdk.org Tue Jun 13 15:10:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 13 Jun 2023 15:10:14 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v3] In-Reply-To: References: Message-ID: <0W6YuCa2Zz6QNHvNnWDev8CPbkaB9kuIaTwyrKzV240=.0a18f0ec-4fa2-440b-93ee-c6614eb9a20a@github.com> On Tue, 13 Jun 2023 13:57:03 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> fixup > > src/hotspot/os/linux/os_linux.cpp line 440: > >> 438: static const char* _crengine = NULL; >> 439: static char* _crengine_arg_str = NULL; >> 440: static const char* _crengine_args[32] = { NULL, NULL, NULL }; > > Should it be really 3 NULLs? Yes. First is in place of the binary; second is the operation (`checkpoint`/`restore`) and we want the third position initialized to NULL - when add_crengine_arg is called it should find a NULL and paste the value in there, initializing the 'tail' to NULL. > src/hotspot/os/linux/os_linux.cpp line 5945: > >> 5943: static void add_crengine_arg(const char *arg) { >> 5944: for (size_t i = 2; i < ARRAY_SIZE(_crengine_args) - 1; ++i) { >> 5945: if (_crengine_args[i] == NULL) { > > The last empty position can be remembered along _crengine_args The fewer static vars the better IMO, but I can certainly use an explicit index var and drop the initialization to three NULLs if you think it's better. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1228293023 PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1228294803 From akozlov at openjdk.org Tue Jun 13 16:48:09 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 13 Jun 2023 16:48:09 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v3] In-Reply-To: References: Message-ID: <_uBHobu5gJN_PqL_heBT_N0U5_-y9_lZjtO2QJO_HcI=.6718c2ca-4ba1-4cf0-881c-db4270ce3b18@github.com> On Tue, 13 Jun 2023 13:46:20 GMT, Radim Vansa wrote: >> In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. >> >> This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > fixup Marked as reviewed by akozlov (Lead). ------------- PR Review: https://git.openjdk.org/crac/pull/63#pullrequestreview-1477649648 From akozlov at openjdk.org Tue Jun 13 16:48:10 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 13 Jun 2023 16:48:10 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v3] In-Reply-To: <0W6YuCa2Zz6QNHvNnWDev8CPbkaB9kuIaTwyrKzV240=.0a18f0ec-4fa2-440b-93ee-c6614eb9a20a@github.com> References: <0W6YuCa2Zz6QNHvNnWDev8CPbkaB9kuIaTwyrKzV240=.0a18f0ec-4fa2-440b-93ee-c6614eb9a20a@github.com> Message-ID: On Tue, 13 Jun 2023 15:05:51 GMT, Radim Vansa wrote: >> src/hotspot/os/linux/os_linux.cpp line 440: >> >>> 438: static const char* _crengine = NULL; >>> 439: static char* _crengine_arg_str = NULL; >>> 440: static const char* _crengine_args[32] = { NULL, NULL, NULL }; >> >> Should it be really 3 NULLs? > > Yes. First is in place of the binary; second is the operation (`checkpoint`/`restore`) and we want the third position initialized to NULL - when add_crengine_arg is called it should find a NULL and paste the value in there, initializing the 'tail' to NULL. OK, thanks, make sense >> src/hotspot/os/linux/os_linux.cpp line 5945: >> >>> 5943: static void add_crengine_arg(const char *arg) { >>> 5944: for (size_t i = 2; i < ARRAY_SIZE(_crengine_args) - 1; ++i) { >>> 5945: if (_crengine_args[i] == NULL) { >> >> The last empty position can be remembered along _crengine_args > > The fewer static vars the better IMO, but I can certainly use an explicit index var and drop the initialization to three NULLs if you think it's better. In general I agree, but since _crengine_args is already stateful, I don't see any troubles adding more statics that is in fact a part of the same state. But this is minor, so up to you. OK for me as it is. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1228426792 PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1228425988 From rmarchenko at openjdk.org Wed Jun 14 10:52:25 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 14 Jun 2023 10:52:25 GMT Subject: [crac] RFR: Fixing VMOptionsTest.java In-Reply-To: References: Message-ID: <3fwZHG0srCtICqyXwUpuPi7g8saYE1z94JDaXICPrCM=.f7780bc7-105d-4fe5-ae1c-bf24116087d6@github.com> On Tue, 13 Jun 2023 14:59:46 GMT, Radim Vansa wrote: >> @rvansa Could you review please? > > Hi @wkia , the PR looks correct, but I wonder if we need to complicate things if the NMT default is not the same in different builds. I've picked this simply as a random specimen of the non-manageable opts (with string args); could you just use something else, e.g. `OnOutOfMemoryError`, or if you want to go with non-empty default there's `MetaspaceReclaimPolicy`? Hi @rvansa, If my understanding correct regarding NMT, the test checks that non-manageable options cannot be set on restore. If so, there is no need to focus on _default_ values. Ok, I will make appropriate changes. ------------- PR Comment: https://git.openjdk.org/crac/pull/81#issuecomment-1590957046 From rmarchenko at openjdk.org Wed Jun 14 11:11:52 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 14 Jun 2023 11:11:52 GMT Subject: [crac] RFR: Fixing VMOptionsTest.java [v2] In-Reply-To: References: Message-ID: > test/jdk/jdk/crac/VMOptionsTest.java fails in debug build because it expects that NativeMemoryTracking equals to"off", that is not true in debug build. > > > Exception in thread "main" java.lang.RuntimeException: assertEquals: expected off to equal summary > at jdk.test.lib.Asserts.fail(Asserts.java:594) > at jdk.test.lib.Asserts.assertEquals(Asserts.java:205) > at jdk.test.lib.Asserts.assertEquals(Asserts.java:189) > at VMOptionsTest.exec(VMOptionsTest.java:60) > at jdk.test.lib.crac.CracTest.run(CracTest.java:155) > at jdk.test.lib.crac.CracTest.main(CracTest.java:87) > > > This change fixes the issue by retrieving NMT value before checkpoint, and comparing later. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Making the test conditions stable ------------- Changes: - all: https://git.openjdk.org/crac/pull/81/files - new: https://git.openjdk.org/crac/pull/81/files/4a27d68d..cb2fa67a Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=81&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=81&range=00-01 Stats: 30 lines in 2 files changed: 12 ins; 2 del; 16 mod Patch: https://git.openjdk.org/crac/pull/81.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/81/head:pull/81 PR: https://git.openjdk.org/crac/pull/81 From duke at openjdk.org Wed Jun 14 11:11:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 14 Jun 2023 11:11:54 GMT Subject: [crac] RFR: Fixing VMOptionsTest.java [v2] In-Reply-To: References: Message-ID: <8xw93wdVx0gyLs37Q6t1uDDsQ1BJxPISgfNPQMJyVKM=.3b874b8d-90ba-48be-bf09-d315a1143db1@github.com> On Wed, 14 Jun 2023 11:06:40 GMT, Roman Marchenko wrote: >> test/jdk/jdk/crac/VMOptionsTest.java fails in debug build because it expects that NativeMemoryTracking equals to"off", that is not true in debug build. >> >> >> Exception in thread "main" java.lang.RuntimeException: assertEquals: expected off to equal summary >> at jdk.test.lib.Asserts.fail(Asserts.java:594) >> at jdk.test.lib.Asserts.assertEquals(Asserts.java:205) >> at jdk.test.lib.Asserts.assertEquals(Asserts.java:189) >> at VMOptionsTest.exec(VMOptionsTest.java:60) >> at jdk.test.lib.crac.CracTest.run(CracTest.java:155) >> at jdk.test.lib.crac.CracTest.main(CracTest.java:87) >> >> >> This change fixes the issue by retrieving NMT value before checkpoint, and comparing later. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Making the test conditions stable LGTM. Out of curiosity, did you find any issues with using the same bean after restore? ------------- Marked as reviewed by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/81#pullrequestreview-1479123593 From rmarchenko at openjdk.org Wed Jun 14 12:21:22 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 14 Jun 2023 12:21:22 GMT Subject: [crac] RFR: Fixing VMOptionsTest.java In-Reply-To: References: Message-ID: <7K4oIueqb5BoqidBKDdUKSdX0CS5ny7njftQ3N51jwY=.a1b16623-0272-4df6-86a3-32791d287ab9@github.com> On Tue, 13 Jun 2023 14:59:46 GMT, Radim Vansa wrote: >> @rvansa Could you review please? > > Hi @wkia , the PR looks correct, but I wonder if we need to complicate things if the NMT default is not the same in different builds. I've picked this simply as a random specimen of the non-manageable opts (with string args); could you just use something else, e.g. `OnOutOfMemoryError`, or if you want to go with non-empty default there's `MetaspaceReclaimPolicy`? @rvansa > did you find any issues with using the same bean after restore? No, there is no problem with it, I just tried to make the checks isolated to prevent any interference. ------------- PR Comment: https://git.openjdk.org/crac/pull/81#issuecomment-1591083283 From rmarchenko at openjdk.org Thu Jun 15 07:27:34 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 15 Jun 2023 07:27:34 GMT Subject: [crac] Integrated: Fixing VMOptionsTest.java In-Reply-To: References: Message-ID: On Tue, 13 Jun 2023 13:11:32 GMT, Roman Marchenko wrote: > test/jdk/jdk/crac/VMOptionsTest.java fails in debug build because it expects that NativeMemoryTracking equals to"off", that is not true in debug build. > > > Exception in thread "main" java.lang.RuntimeException: assertEquals: expected off to equal summary > at jdk.test.lib.Asserts.fail(Asserts.java:594) > at jdk.test.lib.Asserts.assertEquals(Asserts.java:205) > at jdk.test.lib.Asserts.assertEquals(Asserts.java:189) > at VMOptionsTest.exec(VMOptionsTest.java:60) > at jdk.test.lib.crac.CracTest.run(CracTest.java:155) > at jdk.test.lib.crac.CracTest.main(CracTest.java:87) > > > This change fixes the issue by retrieving NMT value before checkpoint, and comparing later. This pull request has now been integrated. Changeset: bd9cf0be Author: Roman Marchenko Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/bd9cf0bee8dacef44b29028fa0ad07b82b2e5783 Stats: 27 lines in 2 files changed: 11 ins; 0 del; 16 mod Fixing VMOptionsTest.java Reviewed-by: rvansa ------------- PR: https://git.openjdk.org/crac/pull/81 From rvansa at openjdk.org Thu Jun 15 10:21:46 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 10:21:46 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v5] In-Reply-To: References: Message-ID: <6Tlio9TSMnER-ICSeo5aihk8BSCm_8GM_QhRdWcqSsU=.06e0932d-f318-4fc1-b24c-4685d560c49b@github.com> > When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. > > These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) > * keywords FIFO and SOCKET that match pipes and sockets > > The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - Merge branch 'crac' into newfd-policies - Refactor FileDescriptor resource to separate class - Add REOPEN_AT_END and OPEN_OTHER_AT_END policies - Merge branch 'crac' into newfd-policies - Effectively revert previous commit: Initialize logger in - Simplify workarounds in SimpleConsoleLogger. - Merge branch 'crac' into newfd-policies - Handle open file descriptors with configurable policies When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: * numeric file descriptor * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) * keywords FIFO and SOCKET that match pipes and sockets The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore ------------- Changes: https://git.openjdk.org/crac/pull/69/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=69&range=04 Stats: 2196 lines in 53 files changed: 2113 ins; 54 del; 29 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From rvansa at openjdk.org Thu Jun 15 11:27:43 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 11:27:43 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: > When the application does not close some file descriptors through Resources we can use `jdk.crac.file-policy.checkpoint`, `jdk.crac.file-policy.restore`, `jdk.crac.socket-policy.checkpoint` and `jdk.crac.socket-policy.checkpoint` to configure the behaviour. > > These properties can specify a list of semicolon-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * for `file-policy`, path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) - this matches named pipes, too > * for `file-policy` keyword `FIFO` matching all pipes (including anonymous) > * for `socket-policy` a `,` pair, with the `` part being optional. Both `` and `` can be unix socket path, IPv4/IPv6 address with optional colon and port number or wildcard `*` replacing any of those parts. > > The possible values are in OpenFilePolicies.BeforeCheckpoint, OpenFilePolicies.AfterRestore, OpenSocketPolicies.BeforeCheckpoint and OpenSocketPolicies.AfterRestore enums. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: cleanup ------------- Changes: - all: https://git.openjdk.org/crac/pull/69/files - new: https://git.openjdk.org/crac/pull/69/files/91e58f17..a5c8a484 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=69&range=05 - incr: https://webrevs.openjdk.org/?repo=crac&pr=69&range=04-05 Stats: 44 lines in 6 files changed: 0 ins; 43 del; 1 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From rvansa at openjdk.org Thu Jun 15 12:19:06 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 12:19:06 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v4] In-Reply-To: References: Message-ID: > In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. > > This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Use index var rather than looking up NULLs ------------- Changes: - all: https://git.openjdk.org/crac/pull/63/files - new: https://git.openjdk.org/crac/pull/63/files/945e95ed..3cf68256 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=63&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=63&range=02-03 Stats: 15 lines in 1 file changed: 3 ins; 4 del; 8 mod Patch: https://git.openjdk.org/crac/pull/63.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/63/head:pull/63 PR: https://git.openjdk.org/crac/pull/63 From rvansa at openjdk.org Thu Jun 15 12:22:30 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 12:22:30 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: On Thu, 15 Jun 2023 11:27:43 GMT, Radim Vansa wrote: >> When the application does not close some file descriptors through Resources we can use `jdk.crac.file-policy.checkpoint`, `jdk.crac.file-policy.restore`, `jdk.crac.socket-policy.checkpoint` and `jdk.crac.socket-policy.checkpoint` to configure the behaviour. >> >> These properties can specify a list of semicolon-separated key=value pairs, where the key can be one of: >> >> * numeric file descriptor >> * for `file-policy`, path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) - this matches named pipes, too >> * for `file-policy` keyword `FIFO` matching all pipes (including anonymous) >> * for `socket-policy` a `,` pair, with the `` part being optional. Both `` and `` can be unix socket path, IPv4/IPv6 address with optional colon and port number or wildcard `*` replacing any of those parts. >> >> The possible values are in OpenFilePolicies.BeforeCheckpoint, OpenFilePolicies.AfterRestore, OpenSocketPolicies.BeforeCheckpoint and OpenSocketPolicies.AfterRestore enums. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > cleanup Merged in `crac` branch, and applied requested changes: * The detection and policies now live in JDKFileResource and JDKSocketResource * Policies for files and sockets are split One thing I am not so sure about is location of some native methods - I've left what's used in `JDKFileResource` in `FileDispatcher_md.c`, and `JDKSocketResource` uses methods both in `UnixDomainSockets.c` and some are in `Net.c`. This might deserve own files, but I've reused what was in those files. ------------- PR Comment: https://git.openjdk.org/crac/pull/69#issuecomment-1592937141 From akozlov at openjdk.org Thu Jun 15 12:39:42 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 15 Jun 2023 12:39:42 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v35] In-Reply-To: References: Message-ID: On Mon, 12 Jun 2023 09:56:32 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 110 commits: > > - Merge branch 'crac' into crac-altstack-cpu-cpuexplicit-strip > - Fix slowdebug compilation. > Split better get_processor_features_hardware/get_processor_features_hotspot(). > - Compatibility with non-X86; untested. > - Simplify error reporting by err_msg(). > - Fix printing missing features on target CPU. > - Fix hotspot 'ht' vs. glibc 'htt'. > - CPUFeatures refactorization. > Start CPU Features checking without libc. > - Reintroduce initialize_processor_count() requiring -XX:+CRaCCPUCountInit. > - requested by Anton Kozlov > - Remove initialize_processor_count(). > - requested by Anton Kozlov > - it was crashing for me for 4 CPU <-> 16 CPU moves > - Reintroduce the "leftover" code which was not leftover. > - ... and 100 more: https://git.openjdk.org/crac/compare/a282698d...7c567e99 I could not spot anything, considering the size of the patch :) LGTM! src/hotspot/cpu/x86/vm_version_x86.cpp line 2564: > 2562: > 2563: if (ShowCPUFeatures) > 2564: nonlibc_tty_print_using_features_cr(); Do I understand correctly that at this point all features are checked, and this can be a regular printing function call, involving libc? src/hotspot/cpu/x86/vm_version_x86.cpp line 2608: > 2606: > 2607: if (ShowCPUFeatures) > 2608: nonlibc_tty_print_using_features_cr(); Can be a regular printing function call? ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/41#pullrequestreview-1481394404 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1230899173 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1230908479 From rvansa at openjdk.org Thu Jun 15 12:40:49 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 12:40:49 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v5] In-Reply-To: References: Message-ID: > In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. > > This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge branch 'crac' into engine_params - Use index var rather than looking up NULLs - fixup - Do not use dashes - Merge branch 'crac' into engine_params - Support passing extra options in CREngine ------------- Changes: https://git.openjdk.org/crac/pull/63/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=63&range=04 Stats: 157 lines in 5 files changed: 126 ins; 5 del; 26 mod Patch: https://git.openjdk.org/crac/pull/63.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/63/head:pull/63 PR: https://git.openjdk.org/crac/pull/63 From rvansa at openjdk.org Thu Jun 15 12:46:32 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 12:46:32 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v6] In-Reply-To: References: Message-ID: > In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. > > This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - Merge branch 'crac' into engine_params - Merge branch 'crac' into engine_params - Use index var rather than looking up NULLs - fixup - Do not use dashes - Merge branch 'crac' into engine_params - Support passing extra options in CREngine ------------- Changes: https://git.openjdk.org/crac/pull/63/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=63&range=05 Stats: 157 lines in 5 files changed: 126 ins; 5 del; 26 mod Patch: https://git.openjdk.org/crac/pull/63.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/63/head:pull/63 PR: https://git.openjdk.org/crac/pull/63 From rvansa at openjdk.org Thu Jun 15 12:46:33 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 12:46:33 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v4] In-Reply-To: References: Message-ID: On Thu, 15 Jun 2023 12:19:06 GMT, Radim Vansa wrote: >> In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. >> >> This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use index var rather than looking up NULLs Updated using the index var. ------------- PR Comment: https://git.openjdk.org/crac/pull/63#issuecomment-1592969385 From rvansa at openjdk.org Thu Jun 15 12:46:33 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 12:46:33 GMT Subject: [crac] Integrated: Support passing extra options in CREngine In-Reply-To: References: Message-ID: On Wed, 10 May 2023 08:58:54 GMT, Radim Vansa wrote: > In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. > > This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. This pull request has now been integrated. Changeset: ddf6a055 Author: Radim Vansa URL: https://git.openjdk.org/crac/commit/ddf6a0550ddb2d6aa61640bb82317c003e52df3d Stats: 157 lines in 5 files changed: 126 ins; 5 del; 26 mod Support passing extra options in CREngine Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/63 From rmarchenko at openjdk.org Thu Jun 15 14:04:14 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 15 Jun 2023 14:04:14 GMT Subject: [crac] RFR: Draft: Win/Mac build for CRaC Message-ID: Draft: Win/Mac build for CRaC ------------- Commit messages: - Removing extra spaces - Merge remote-tracking branch 'origin/crac' into win_mac_build - Fixing crac::doit() - Merge branch 'test-fix' into win_mac_build - Making the test conditions stable - Fixing test/jdk/jdk/crac/VMOptionsTest.java - win/mac build prepared Changes: https://git.openjdk.org/crac/pull/82/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=82&range=00 Stats: 1738 lines in 42 files changed: 1068 ins; 643 del; 27 mod Patch: https://git.openjdk.org/crac/pull/82.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/82/head:pull/82 PR: https://git.openjdk.org/crac/pull/82 From rvansa at openjdk.org Thu Jun 15 14:28:27 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 15 Jun 2023 14:28:27 GMT Subject: [crac] RFR: Draft: Win/Mac build for CRaC In-Reply-To: References: Message-ID: On Thu, 15 Jun 2023 13:35:39 GMT, Roman Marchenko wrote: > Draft: Win/Mac build for CRaC I would probably have to do a more careful function-by-function comparison to check if any other changes were accidentally reverted. src/hotspot/os/windows/os_windows.cpp line 6166: > 6164: } > 6165: > 6166: void VM_Crac::report_ok_to_jcmd_if_any() { What happens when you execute JDK checkpoint from `jcmd` on Windows? src/hotspot/share/runtime/os.cpp line 104: > 102: > 103: // CRaC > 104: static const char* _crengine = NULL; Looks like you unintentionally reverted recent changes - search for `_crengine_args` and handling code. src/hotspot/share/runtime/os.cpp line 2177: > 2175: } > 2176: > 2177: bool VM_Crac::is_claimed_fd(int fd) { What about moving all this into `crac.cpp` rather than extending the already-long-enought `os.cpp`? src/java.base/share/native/launcher/main.c line 155: > 153: sigact.sa_sigaction = sighandler; > 154: > 155: for (int sig = 1; sig <= 31; ++sig) { Hardcoded constant? src/java.base/windows/native/pauseengine/pauseengine.c line 36: > 34: typedef int pid_t; > 35: > 36: #define RESTORE_SIGNAL (SIGRTMIN + 2) Not used, I assume? I would still rather see the simplified engines sharing codebase. ------------- PR Review: https://git.openjdk.org/crac/pull/82#pullrequestreview-1481675238 PR Review Comment: https://git.openjdk.org/crac/pull/82#discussion_r1231075475 PR Review Comment: https://git.openjdk.org/crac/pull/82#discussion_r1231081166 PR Review Comment: https://git.openjdk.org/crac/pull/82#discussion_r1231086673 PR Review Comment: https://git.openjdk.org/crac/pull/82#discussion_r1231093202 PR Review Comment: https://git.openjdk.org/crac/pull/82#discussion_r1231096407 From rmarchenko at openjdk.org Thu Jun 15 16:04:26 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 15 Jun 2023 16:04:26 GMT Subject: [crac] RFR: Draft: Win/Mac build for CRaC In-Reply-To: References: Message-ID: On Thu, 15 Jun 2023 14:25:18 GMT, Radim Vansa wrote: >> Please ignore this PR for now. >> Currently it is for testing purposes. > > I would probably have to do a more careful function-by-function comparison to check if any other changes were accidentally reverted. @rvansa Hi, You can ignore this PR for now. Currently it is for testing purposes. I will remove "draft" mark when it is ready for review. > src/hotspot/share/runtime/os.cpp line 104: > >> 102: >> 103: // CRaC >> 104: static const char* _crengine = NULL; > > Looks like you unintentionally reverted recent changes - search for `_crengine_args` and handling code. Thanks, I will check this > src/hotspot/share/runtime/os.cpp line 2177: > >> 2175: } >> 2176: >> 2177: bool VM_Crac::is_claimed_fd(int fd) { > > What about moving all this into `crac.cpp` rather than extending the already-long-enought `os.cpp`? This is being performed now for another PR. > src/java.base/share/native/launcher/main.c line 155: > >> 153: sigact.sa_sigaction = sighandler; >> 154: >> 155: for (int sig = 1; sig <= 31; ++sig) { > > Hardcoded constant? Yes, for different platforms there are no named constants with the same meaning. We could create one especially for CRaC. ------------- PR Comment: https://git.openjdk.org/crac/pull/82#issuecomment-1593334909 PR Review Comment: https://git.openjdk.org/crac/pull/82#discussion_r1231235492 PR Review Comment: https://git.openjdk.org/crac/pull/82#discussion_r1231234419 PR Review Comment: https://git.openjdk.org/crac/pull/82#discussion_r1231232991 From rmarchenko at openjdk.org Fri Jun 16 09:47:38 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 16 Jun 2023 09:47:38 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v6] In-Reply-To: References: Message-ID: On Thu, 15 Jun 2023 12:46:32 GMT, Radim Vansa wrote: >> In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. >> >> This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Merge branch 'crac' into engine_params > - Merge branch 'crac' into engine_params > - Use index var rather than looking up NULLs > - fixup > - Do not use dashes > - Merge branch 'crac' into engine_params > - Support passing extra options in CREngine It seems like `jdk/crac/recursiveCheckpoint/Test.java` starts failing with timeout after this PR. Have you tested it locally? As far as I see, the test is run with a comma added to CR engine name ("`-XX:CREngine=pauseengine,`") And then, `pauseengine `cannot recognize `imagedir` name, because `arg[2]` is empty. Can you reproduce this locally? ------------- PR Comment: https://git.openjdk.org/crac/pull/63#issuecomment-1594413916 From rmarchenko at openjdk.org Fri Jun 16 10:08:37 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 16 Jun 2023 10:08:37 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v6] In-Reply-To: References: Message-ID: <7JSisQB0945MDMggkF9z9b-5htWfN-2HDzhTwR8snec=.13eda644-ff3d-4587-8b2a-cbffaa99439c@github.com> On Thu, 15 Jun 2023 12:46:32 GMT, Radim Vansa wrote: >> In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,--key,value,--anotherkey` that translates into invoking `program --key value --anotherkey`. Commas support escaping with a backslash. >> >> This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Merge branch 'crac' into engine_params > - Merge branch 'crac' into engine_params > - Use index var rather than looking up NULLs > - fixup > - Do not use dashes > - Merge branch 'crac' into engine_params > - Support passing extra options in CREngine test/lib/jdk/test/lib/crac/CracBuilder.java line 338: > 336: cmd.add("-ea"); > 337: if (engine != null) { > 338: String engArgs = engineArgs == null ? "" : "," + Arrays.stream(engineArgs) I suppose `engineArgs == null` should be replaced with something like `engineArgs == null || 0 == engineArgs.length` ------------- PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1232051616 From rvansa at openjdk.org Fri Jun 16 10:08:37 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 10:08:37 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v6] In-Reply-To: <7JSisQB0945MDMggkF9z9b-5htWfN-2HDzhTwR8snec=.13eda644-ff3d-4587-8b2a-cbffaa99439c@github.com> References: <7JSisQB0945MDMggkF9z9b-5htWfN-2HDzhTwR8snec=.13eda644-ff3d-4587-8b2a-cbffaa99439c@github.com> Message-ID: On Fri, 16 Jun 2023 10:04:34 GMT, Roman Marchenko wrote: >> Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: >> >> - Merge branch 'crac' into engine_params >> - Merge branch 'crac' into engine_params >> - Use index var rather than looking up NULLs >> - fixup >> - Do not use dashes >> - Merge branch 'crac' into engine_params >> - Support passing extra options in CREngine > > test/lib/jdk/test/lib/crac/CracBuilder.java line 338: > >> 336: cmd.add("-ea"); >> 337: if (engine != null) { >> 338: String engArgs = engineArgs == null ? "" : "," + Arrays.stream(engineArgs) > > I suppose `engineArgs == null` should be replaced with something like `engineArgs == null || 0 == engineArgs.length` You're right, I must have had something cached because when I re-ran the test it was passing... until I added some debug logs to confirm what you're seeing. I'll file a fix in a sec. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1232052834 From rvansa at openjdk.org Fri Jun 16 10:13:42 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 10:13:42 GMT Subject: [crac] RFR: Support passing extra options in CREngine [v6] In-Reply-To: References: <7JSisQB0945MDMggkF9z9b-5htWfN-2HDzhTwR8snec=.13eda644-ff3d-4587-8b2a-cbffaa99439c@github.com> Message-ID: On Fri, 16 Jun 2023 10:05:45 GMT, Radim Vansa wrote: >> test/lib/jdk/test/lib/crac/CracBuilder.java line 338: >> >>> 336: cmd.add("-ea"); >>> 337: if (engine != null) { >>> 338: String engArgs = engineArgs == null ? "" : "," + Arrays.stream(engineArgs) >> >> I suppose `engineArgs == null` should be replaced with something like `engineArgs == null || 0 == engineArgs.length` > > You're right, I must have had something cached because when I re-ran the test it was passing... until I added some debug logs to confirm what you're seeing. I'll file a fix in a sec. Filed https://github.com/openjdk/crac/pull/83 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/63#discussion_r1232058111 From rvansa at openjdk.org Fri Jun 16 10:16:32 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 10:16:32 GMT Subject: [crac] RFR: Fix recursiveCheckpoint/Test passing extra comma Message-ID: Caused by ddf6a0550ddb2d6aa61640bb82317c003e52df3d ------------- Commit messages: - Fix recursiveCheckpoint/Test passing extra comma Changes: https://git.openjdk.org/crac/pull/83/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=83&range=00 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/83.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/83/head:pull/83 PR: https://git.openjdk.org/crac/pull/83 From rvansa at openjdk.org Fri Jun 16 10:16:32 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 10:16:32 GMT Subject: [crac] RFR: Fix recursiveCheckpoint/Test passing extra comma In-Reply-To: References: Message-ID: On Fri, 16 Jun 2023 10:08:34 GMT, Radim Vansa wrote: > Caused by ddf6a0550ddb2d6aa61640bb82317c003e52df3d @wkia Please review. ------------- PR Comment: https://git.openjdk.org/crac/pull/83#issuecomment-1594444977 From rmarchenko at openjdk.org Fri Jun 16 10:21:27 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 16 Jun 2023 10:21:27 GMT Subject: [crac] RFR: Fix recursiveCheckpoint/Test passing extra comma In-Reply-To: References: Message-ID: On Fri, 16 Jun 2023 10:08:34 GMT, Radim Vansa wrote: > Caused by ddf6a0550ddb2d6aa61640bb82317c003e52df3d Marked as reviewed by rmarchenko (no project role). ------------- PR Review: https://git.openjdk.org/crac/pull/83#pullrequestreview-1483166051 From rmarchenko at openjdk.org Fri Jun 16 10:31:12 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 16 Jun 2023 10:31:12 GMT Subject: [crac] RFR: Refactoring - extracted crac files Message-ID: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. ------------- Commit messages: - Removing trailing spaces - Refactoring - extracted crac* files Changes: https://git.openjdk.org/crac/pull/84/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=84&range=00 Stats: 2177 lines in 10 files changed: 1135 ins; 1036 del; 6 mod Patch: https://git.openjdk.org/crac/pull/84.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/84/head:pull/84 PR: https://git.openjdk.org/crac/pull/84 From rmarchenko at openjdk.org Fri Jun 16 10:31:12 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 16 Jun 2023 10:31:12 GMT Subject: [crac] RFR: Refactoring - extracted crac files In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: On Fri, 16 Jun 2023 10:20:35 GMT, Roman Marchenko wrote: > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. @AntonKozlov @rvansa Hi, As we discussed before, I extracted crac* files. Hopefully it will help with futher java version upgrade and win/mac build. ------------- PR Comment: https://git.openjdk.org/crac/pull/84#issuecomment-1594460000 From rvansa at openjdk.org Fri Jun 16 10:49:34 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 10:49:34 GMT Subject: [crac] RFR: Refactoring - extracted crac files In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: On Fri, 16 Jun 2023 10:20:35 GMT, Roman Marchenko wrote: > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. LGTM, I've checked the stuff moved from os_linux.cpp and it matches. ------------- Marked as reviewed by rvansa (Committer). PR Review: https://git.openjdk.org/crac/pull/84#pullrequestreview-1483202939 From akozlov at openjdk.org Fri Jun 16 11:41:31 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 16 Jun 2023 11:41:31 GMT Subject: [crac] RFR: Refactoring - extracted crac files In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: <-nYRCU_T2HUVmfid9LYshQkDSBJkUxY9jByLj5KvDG8=.ab682e3a-d2f6-4bbb-9269-d9e4e01cdf8c@github.com> On Fri, 16 Jun 2023 10:20:35 GMT, Roman Marchenko wrote: > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. src/hotspot/share/runtime/arguments.cpp line 3235: > 3233: #endif // CAN_SHOW_REGISTERS_ON_ASSERT > 3234: > 3235: if (CRaCCheckpointTo && !crac::Linux::prepare_checkpoint()) { Does it really make sense to have `crac::Linux` instead of just `crac::`? The most of the operations are not related to Linux, so it may be even better to start abstracting them from Linux in this PR, to avoid another similar renaming in the same place. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/84#discussion_r1232131731 From rmarchenko at openjdk.org Fri Jun 16 11:49:30 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 16 Jun 2023 11:49:30 GMT Subject: [crac] RFR: Refactoring - extracted crac files In-Reply-To: <-nYRCU_T2HUVmfid9LYshQkDSBJkUxY9jByLj5KvDG8=.ab682e3a-d2f6-4bbb-9269-d9e4e01cdf8c@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> <-nYRCU_T2HUVmfid9LYshQkDSBJkUxY9jByLj5KvDG8=.ab682e3a-d2f6-4bbb-9269-d9e4e01cdf8c@github.com> Message-ID: On Fri, 16 Jun 2023 11:35:22 GMT, Anton Kozlov wrote: >> CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. > > src/hotspot/share/runtime/arguments.cpp line 3235: > >> 3233: #endif // CAN_SHOW_REGISTERS_ON_ASSERT >> 3234: >> 3235: if (CRaCCheckpointTo && !crac::Linux::prepare_checkpoint()) { > > Does it really make sense to have `crac::Linux` instead of just `crac::`? The most of the operations are not related to Linux, so it may be even better to start abstracting them from Linux in this PR, to avoid another similar renaming in the same place. Ok, let's get rid of crac::linux now. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/84#discussion_r1232141645 From rvansa at openjdk.org Fri Jun 16 12:11:33 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 12:11:33 GMT Subject: [crac] Integrated: Fix recursiveCheckpoint/Test passing extra comma In-Reply-To: References: Message-ID: On Fri, 16 Jun 2023 10:08:34 GMT, Radim Vansa wrote: > Caused by ddf6a0550ddb2d6aa61640bb82317c003e52df3d This pull request has now been integrated. Changeset: e75dbb26 Author: Radim Vansa URL: https://git.openjdk.org/crac/commit/e75dbb261285ddf07f47156967774e1a7a1e52ec Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod Fix recursiveCheckpoint/Test passing extra comma Reviewed-by: rmarchenko ------------- PR: https://git.openjdk.org/crac/pull/83 From rmarchenko at openjdk.org Fri Jun 16 13:20:39 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 16 Jun 2023 13:20:39 GMT Subject: [crac] RFR: Refactoring - extracted crac files [v2] In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: <8z4QkradjxKPFRRIbYDciN_rx_59LeiwNmUP2pKtGiM=.701651a0-b29a-435f-bfe5-4b32ceeb2b60@github.com> > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'crac' into crac-extract - Getting rid of crac::Linux - Removing trailing spaces - Refactoring - extracted crac* files ------------- Changes: - all: https://git.openjdk.org/crac/pull/84/files - new: https://git.openjdk.org/crac/pull/84/files/7b169d95..e3c180f7 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=84&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=84&range=00-01 Stats: 64 lines in 8 files changed: 9 ins; 39 del; 16 mod Patch: https://git.openjdk.org/crac/pull/84.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/84/head:pull/84 PR: https://git.openjdk.org/crac/pull/84 From akozlov at openjdk.org Fri Jun 16 13:36:31 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 16 Jun 2023 13:36:31 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: On Thu, 15 Jun 2023 11:27:43 GMT, Radim Vansa wrote: >> When the application does not close some file descriptors through Resources we can use `jdk.crac.file-policy.checkpoint`, `jdk.crac.file-policy.restore`, `jdk.crac.socket-policy.checkpoint` and `jdk.crac.socket-policy.checkpoint` to configure the behaviour. >> >> These properties can specify a list of semicolon-separated key=value pairs, where the key can be one of: >> >> * numeric file descriptor >> * for `file-policy`, path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) - this matches named pipes, too >> * for `file-policy` keyword `FIFO` matching all pipes (including anonymous) >> * for `socket-policy` a `,` pair, with the `` part being optional. Both `` and `` can be unix socket path, IPv4/IPv6 address with optional colon and port number or wildcard `*` replacing any of those parts. >> >> The possible values are in OpenFilePolicies.BeforeCheckpoint, OpenFilePolicies.AfterRestore, OpenSocketPolicies.BeforeCheckpoint and OpenSocketPolicies.AfterRestore enums. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > cleanup Some random notes from first reading, I'll do another pass. This is not a request to change anything. src/java.base/share/classes/java/io/FileInputStream.java line 539: > 537: } > 538: }; > 539: Providing full range of policies for FileInputStreams seems inherently unsafe, e.g. *_AT_END. I also think we need to distinguish FileInputStreams created with a path or with fd. src/java.base/share/classes/java/io/FileOutputStream.java line 470: > 468: > 469: @Override > 470: protected Supplier claimException(int fdNum, String path) { Here and everywhere else, please use platform independent `FileDescriptor`. src/java.base/share/classes/java/io/RandomAccessFile.java line 89: > 87: return fd; > 88: } > 89: }; The path known to RandomAccessFile and one read from the OS may differ, if the former is symlink. So the RandomAccessFile may need to override getPath, getOffset,.. Apparently, RandomAccessFile don't need operating system intraspection at all. src/java.base/share/classes/java/lang/ProcessBuilder.java line 715: > 713: final FileDescriptor fd; > 714: @SuppressWarnings("unused") > 715: final JDKFileResource resource = new JDKFileResource(this) { Should really a pipe be handled as a File? src/java.base/share/classes/java/net/DatagramSocketImpl.java line 79: > 77: // We don't know the protocol family when this socket is created and FD allocated, but it's not UNIX > 78: @SuppressWarnings("unused") > 79: private final JDKSocketResource resource = new JDKSocketResource(this, StandardProtocolFamily.INET, () -> fd); Is there family notion on java level? Having this comment, it seems we may stop considering family. src/java.base/share/classes/jdk/crac/impl/OpenFilePolicies.java line 220: > 218: * and opens the file at the end. > 219: */ > 220: OPEN_OTHER_AT_END, Semantically, these modes are the same, the second is for FD opened in APPEND mode. But having both modes allows changing the FD mode for the application. Such change will break app assumptions about the file descriptor. ------------- PR Review: https://git.openjdk.org/crac/pull/69#pullrequestreview-1474975958 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232257310 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232240489 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232210712 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232211126 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232212645 PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232237388 From rvansa at openjdk.org Fri Jun 16 13:55:35 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 13:55:35 GMT Subject: [crac] RFR: X11 CRaC reinitializing on CheckpointRestore [v4] In-Reply-To: References: Message-ID: On Mon, 12 Sep 2022 07:50:57 GMT, Ilya Kuznetsov wrote: >> Allows CRaC to perform a CheckpointRestore operation for applications using GUI (Swing, AWT) and X11 connection. >> >> Resources are registered only if the application uses the GUI. The order in which resources are reinitialized matters: Toolkit should be cleared before reference handling for a proper garbage collection, and GraphicsEnvironment after handling for a correct X11 disconnection. Some resources restore lazily. >> >> The `beforeCheckpoint()` operation dispose necessary toolkit and connection resources and disconnects from X11. This allows CRaC to perform a Checkpoint since there is no external connection. >> The `afterRestore()` operations reconnect to X11 and then restore necessary connection and toolkit resources. >> >> Thus, after the Restore operation, we have a clean X11 connection. It is ready to restore the original GUI state. > > Ilya Kuznetsov has updated the pull request incrementally with one additional commit since the last revision: > > Fix newline This PR looks abandoned, and while it involves quite some work it should be focused, and include a demo (test case). Looking forward to any further work on this, but in this state I think it is in no shape for integration, therefore closing. ------------- PR Comment: https://git.openjdk.org/crac/pull/19#issuecomment-1594713129 From rvansa at openjdk.org Fri Jun 16 13:55:36 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 13:55:36 GMT Subject: [crac] Withdrawn: X11 CRaC reinitializing on CheckpointRestore In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 12:58:21 GMT, Ilya Kuznetsov wrote: > Allows CRaC to perform a CheckpointRestore operation for applications using GUI (Swing, AWT) and X11 connection. > > Resources are registered only if the application uses the GUI. The order in which resources are reinitialized matters: Toolkit should be cleared before reference handling for a proper garbage collection, and GraphicsEnvironment after handling for a correct X11 disconnection. Some resources restore lazily. > > The `beforeCheckpoint()` operation dispose necessary toolkit and connection resources and disconnects from X11. This allows CRaC to perform a Checkpoint since there is no external connection. > The `afterRestore()` operations reconnect to X11 and then restore necessary connection and toolkit resources. > > Thus, after the Restore operation, we have a clean X11 connection. It is ready to restore the original GUI state. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/crac/pull/19 From rvansa at openjdk.org Fri Jun 16 14:01:34 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 14:01:34 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: <-YgYSPpv8Z1ziwegtg1BN-nFMTzw17hTvRUIO0sDuuI=.bf7214a0-8ae6-460c-a5b4-ebf9d1147a2a@github.com> On Thu, 13 Apr 2023 15:45:50 GMT, Anton Kozlov wrote: >> @AntonKozlov >> >>> Crac-criu does not use restore timens [1] since once a bug in kernel or criu caused timedwait to return immediatelly everytime that is called after restore. I don't remember the bug exactly (already fixed), but I believe it was discussed on this maillist >> >> https://github.com/CRaC/criu/commit/1cb2f4a518a4ae471a1df7a9b540203c1efaf1ba commit is dated July 14, 2020, but the crac-dev archives has earliest mailing list from Sept 2021. Is there some other mailing list this was discussed on? I am interested in understanding the problem that prompted not to use timens in criu. >> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. >> >>> In general, we should not to depend on very obscure linux abillities, as this reduce chances we'd be able to run on something rather than linux. >> >> I don't think timens can be put in the category of obscure linux ability. It has even made its way into container runtime spec: https://github.com/opencontainers/runtime-spec/pull/1151. > >> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. > > AFAIK the bug is fixed, but I see no point of relying on OS here. Is there one? Timens that is not changed by CRIU provides correct values for our nanoTime() [1]. > >> The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine > > [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/System.html#nanoTime() Hi @AntonKozlov was this only off your radar or is there anything that still needs consideration? ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1594721393 From rvansa at openjdk.org Fri Jun 16 14:12:30 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 14:12:30 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: On Fri, 16 Jun 2023 12:50:55 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup > > src/java.base/share/classes/java/lang/ProcessBuilder.java line 715: > >> 713: final FileDescriptor fd; >> 714: @SuppressWarnings("unused") >> 715: final JDKFileResource resource = new JDKFileResource(this) { > > Should really a pipe be handled as a File? I initially removed pipes from OpenFilePolicies and then returned that back in. Java handles pipes (like standard IO) using FileInputStream/FileOutputStream anyway, `Process.PipeInputStream` extends `FileInputStream` etc. From the streaming API access to named pipes is the same as accessing a file. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232306034 From rvansa at openjdk.org Fri Jun 16 14:16:29 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 14:16:29 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: On Fri, 16 Jun 2023 12:52:26 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup > > src/java.base/share/classes/java/net/DatagramSocketImpl.java line 79: > >> 77: // We don't know the protocol family when this socket is created and FD allocated, but it's not UNIX >> 78: @SuppressWarnings("unused") >> 79: private final JDKSocketResource resource = new JDKSocketResource(this, StandardProtocolFamily.INET, () -> fd); > > Is there family notion on java level? Having this comment, it seems we may stop considering family. The family here is used to distinguish network sockets vs. unix sockets; IPv4 and IPv6 is handled identically. I have considered making the arg just boolean, but using `family` seemed more future proof. POSIX defines many more types... In fact the IPv4/IPv6 is decided based on a static call on IPv6 support, so I could use a static helper method (would have to expose IPv6 availability, `Net.isIPv6Available` is package-private). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232311357 From rvansa at openjdk.org Fri Jun 16 14:30:31 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 14:30:31 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: On Fri, 16 Jun 2023 13:14:14 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup > > src/java.base/share/classes/jdk/crac/impl/OpenFilePolicies.java line 220: > >> 218: * and opens the file at the end. >> 219: */ >> 220: OPEN_OTHER_AT_END, > > Semantically, these modes are the same, the second is for FD opened in APPEND mode. But having both modes allows changing the FD mode for the application. Such change will break app assumptions about the file descriptor. There's many ways that you could break an application invariant if the file is changed/removed when the application thinks it has exclusive write access to it, e.g. it might not expect to write files bigger than X because it streams at most X bytes to that, but if you change the position during reopening you might violate that expectation. These are different options to handle that case: if you think the simplicity of use outweighs the choice I can make OPEN_OTHER use the _AT_END semantics when the file was opened for appending. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232327454 From rvansa at openjdk.org Fri Jun 16 14:49:43 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 14:49:43 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: On Fri, 16 Jun 2023 12:50:26 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup > > src/java.base/share/classes/java/io/RandomAccessFile.java line 89: > >> 87: return fd; >> 88: } >> 89: }; > > The path known to RandomAccessFile and one read from the OS may differ, if the former is symlink. So the RandomAccessFile may need to override getPath, getOffset,.. Apparently, RandomAccessFile don't need operating system intraspection at all. The symlink has a point, from the 'application intention' view. On the other hand if the file is changed; there's a higher chance that something unexpected will be read. In any case user has the chance to override the target in policies. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/69#discussion_r1232351685 From rvansa at openjdk.org Fri Jun 16 15:50:11 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 15:50:11 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v7] In-Reply-To: References: Message-ID: <3U03F-u20OLA58CnCY4Y82QCqLzgdmXbISnc6mKACO4=.2c1a0d06-2e71-4bd6-a9fe-0423326afe22@github.com> > When the application does not close some file descriptors through Resources we can use `jdk.crac.file-policy.checkpoint`, `jdk.crac.file-policy.restore`, `jdk.crac.socket-policy.checkpoint` and `jdk.crac.socket-policy.checkpoint` to configure the behaviour. > > These properties can specify a list of semicolon-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * for `file-policy`, path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) - this matches named pipes, too > * for `file-policy` keyword `FIFO` matching all pipes (including anonymous) > * for `socket-policy` a `,` pair, with the `` part being optional. Both `` and `` can be unix socket path, IPv4/IPv6 address with optional colon and port number or wildcard `*` replacing any of those parts. > > The possible values are in OpenFilePolicies.BeforeCheckpoint, OpenFilePolicies.AfterRestore, OpenSocketPolicies.BeforeCheckpoint and OpenSocketPolicies.AfterRestore enums. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Don't use numeric FDs, remote _AT_END policies ------------- Changes: - all: https://git.openjdk.org/crac/pull/69/files - new: https://git.openjdk.org/crac/pull/69/files/a5c8a484..69d03522 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=69&range=06 - incr: https://webrevs.openjdk.org/?repo=crac&pr=69&range=05-06 Stats: 439 lines in 10 files changed: 220 ins; 196 del; 23 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From rvansa at openjdk.org Fri Jun 16 15:50:14 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 16 Jun 2023 15:50:14 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: On Thu, 15 Jun 2023 11:27:43 GMT, Radim Vansa wrote: >> When the application does not close some file descriptors through Resources we can use `jdk.crac.file-policy.checkpoint`, `jdk.crac.file-policy.restore`, `jdk.crac.socket-policy.checkpoint` and `jdk.crac.socket-policy.checkpoint` to configure the behaviour. >> >> These properties can specify a list of semicolon-separated key=value pairs, where the key can be one of: >> >> * numeric file descriptor >> * for `file-policy`, path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) - this matches named pipes, too >> * for `file-policy` keyword `FIFO` matching all pipes (including anonymous) >> * for `socket-policy` a `,` pair, with the `` part being optional. Both `` and `` can be unix socket path, IPv4/IPv6 address with optional colon and port number or wildcard `*` replacing any of those parts. >> >> The possible values are in OpenFilePolicies.BeforeCheckpoint, OpenFilePolicies.AfterRestore, OpenSocketPolicies.BeforeCheckpoint and OpenSocketPolicies.AfterRestore enums. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > cleanup Updated: * not using numeric FDs in claimExceptions * removed the _AT_END policies: I decided that the use is really niche * explicitly prevented creating a file through the policy (added to documentation): If the user really wants a new file he should touch it beforehand. ------------- PR Comment: https://git.openjdk.org/crac/pull/69#issuecomment-1594893753 From akozlov at openjdk.org Mon Jun 19 11:17:40 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 19 Jun 2023 11:17:40 GMT Subject: [crac] RFR: Refactoring - extracted crac files [v2] In-Reply-To: <8z4QkradjxKPFRRIbYDciN_rx_59LeiwNmUP2pKtGiM=.701651a0-b29a-435f-bfe5-4b32ceeb2b60@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> <8z4QkradjxKPFRRIbYDciN_rx_59LeiwNmUP2pKtGiM=.701651a0-b29a-435f-bfe5-4b32ceeb2b60@github.com> Message-ID: On Fri, 16 Jun 2023 13:20:39 GMT, Roman Marchenko wrote: >> CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. > > Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'crac' into crac-extract > - Getting rid of crac::Linux > - Removing trailing spaces > - Refactoring - extracted crac* files I would propose "Extract crac functionality into OS-agnostic files" as the title and eventual commit message, or something like that. LGTM in the rest! Thank you. ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/84#pullrequestreview-1485989811 From akozlov at openjdk.org Mon Jun 19 11:57:44 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 19 Jun 2023 11:57:44 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v10] In-Reply-To: References: Message-ID: On Fri, 19 May 2023 10:25:18 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge branch 'crac' into nanotime > - Merge branch 'crac' into nanotime > - More checks when reading boot ID > - Merge branch 'crac' into nanotime > - Do not use negative monotonic offset > - Merge branch 'crac' into nanotime > - Fix whitespaces > - Use image under ghcr.io/crac > - Ensure monotonicity for the same boot > - Set nanotime only if bootid changes > - ... and 13 more: https://git.openjdk.org/crac/compare/ed3efac0...7d7a4103 Sorry, LGTM! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/53#pullrequestreview-1486055855 From akozlov at openjdk.org Mon Jun 19 13:25:47 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 19 Jun 2023 13:25:47 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v2] In-Reply-To: <8z4QkradjxKPFRRIbYDciN_rx_59LeiwNmUP2pKtGiM=.701651a0-b29a-435f-bfe5-4b32ceeb2b60@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> <8z4QkradjxKPFRRIbYDciN_rx_59LeiwNmUP2pKtGiM=.701651a0-b29a-435f-bfe5-4b32ceeb2b60@github.com> Message-ID: <902rn2g7suwrHTDU_42VlaRXZ6G0e1b6YFOhK6B75J8=.e9147d89-8e52-4f67-927d-3ad803afd7a6@github.com> On Fri, 16 Jun 2023 13:20:39 GMT, Roman Marchenko wrote: >> CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. > > Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'crac' into crac-extract > - Getting rid of crac::Linux > - Removing trailing spaces > - Refactoring - extracted crac* files Marked as reviewed by akozlov (Lead). ------------- PR Review: https://git.openjdk.org/crac/pull/84#pullrequestreview-1486229382 From rvansa at openjdk.org Mon Jun 19 14:08:36 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 19 Jun 2023 14:08:36 GMT Subject: git: openjdk/crac: crac: Correct System.nanotime() value after restore Message-ID: <174ecde0-f49b-4467-9b34-30341a15d029@openjdk.org> Changeset: a8e69ded Author: Radim Vansa Date: 2023-06-19 14:06:51 +0000 URL: https://git.openjdk.org/crac/commit/a8e69dede4654f76916db19ac008f20902d752aa Correct System.nanotime() value after restore Reviewed-by: akozlov ! src/hotspot/os/linux/os_linux.cpp ! src/hotspot/os/posix/os_posix.cpp ! src/hotspot/share/runtime/os.cpp ! src/hotspot/share/runtime/os.hpp + test/jdk/jdk/crac/java/lang/System/NanoTimeTest.java ! test/lib/jdk/test/lib/containers/docker/DockerfileConfig.java ! test/lib/jdk/test/lib/crac/CracBuilder.java From rvansa at openjdk.org Mon Jun 19 14:09:51 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 19 Jun 2023 14:09:51 GMT Subject: [crac] Integrated: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: <1i2BoqmUaOFSOjl-PPnDPrdykQeGs2B5gvJAbjSE5Es=.c6496cf5-2433-4f60-ad26-543f932de43b@github.com> On Thu, 23 Mar 2023 15:38:35 GMT, Radim Vansa wrote: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. This pull request has now been integrated. Changeset: a8e69ded Author: Radim Vansa URL: https://git.openjdk.org/crac/commit/a8e69dede4654f76916db19ac008f20902d752aa Stats: 312 lines in 7 files changed: 285 ins; 0 del; 27 mod Correct System.nanotime() value after restore Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/53 From rvansa at openjdk.org Mon Jun 19 14:14:36 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 19 Jun 2023 14:14:36 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files In-Reply-To: References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: On Fri, 16 Jun 2023 10:23:06 GMT, Roman Marchenko wrote: >> CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. > > @AntonKozlov @rvansa > Hi, > As we discussed before, I extracted crac* files. Hopefully it will help with futher java version upgrade and win/mac build. @wkia Integrating #53 regrettably brought a conflict, shouldn't be difficult to resolve, though. Thanks! ------------- PR Comment: https://git.openjdk.org/crac/pull/84#issuecomment-1597265647 From rvansa at openjdk.org Mon Jun 19 14:52:55 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Mon, 19 Jun 2023 14:52:55 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore Message-ID: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_wait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_timed_wait. Implementation either handles that transparently or propagates the wakeup to Java. This commit does not handle timed waiting in non-Java threads other than WatcherThread. ------------- Commit messages: - Wake up all TIMED_WAITING threads after restore Changes: https://git.openjdk.org/crac/pull/85/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=85&range=00 Stats: 236 lines in 2 files changed: 236 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/85.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/85/head:pull/85 PR: https://git.openjdk.org/crac/pull/85 From akozlov at openjdk.org Mon Jun 19 15:55:33 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 19 Jun 2023 15:55:33 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore In-Reply-To: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Mon, 19 Jun 2023 14:46:35 GMT, Radim Vansa wrote: > Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_wait(). Probably pthread_cond_timedwait() ? Should not be other *_timedwait be handled, like sem_timedwait()? > When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. I assume this happens in the implementation of pthread_cond_timedwait and not the JVM caller function? Otherwise I can't see the code doing the recalculation. > This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_timed_wait. > Implementation either handles that transparently or propagates the wakeup to Java. OS-level spurious wake-ups are fine, but propogating that to java level seems dangerous. I could not spot the example of the propogation. E.g. can that happen to Thread.sleep()? src/hotspot/os/linux/os_linux.cpp line 6124: > 6122: assert(list != NULL, "Thread list is null"); > 6123: for (uint i = 0; i < list->length(); ++i) { > 6124: JavaThread* t = list->thread_at(i); Please use Threads::threads_do, that includes non-Java threads as well. src/hotspot/os/linux/os_linux.cpp line 6132: > 6130: t->interrupt(); > 6131: t->osthread()->set_interrupted(false); > 6132: } I think the problem of the native implementation of the timed wait is being fixed on the wrong level of abstraction. The interrupt() is much higher leve operation than the pthread_cond_signal, that we are actually targeting. And the latter would be also much cheaper. Could you replace the code with signaling every cond? ------------- PR Comment: https://git.openjdk.org/crac/pull/85#issuecomment-1597418613 PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1234210999 PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1234235422 From rvansa at openjdk.org Tue Jun 20 06:40:36 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Tue, 20 Jun 2023 06:40:36 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Mon, 19 Jun 2023 15:52:04 GMT, Anton Kozlov wrote: >> This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_wait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. >> >> This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_timed_wait. Implementation either handles that transparently or propagates the wakeup to Java. >> >> This commit does not handle timed waiting in non-Java threads other than WatcherThread. > > src/hotspot/os/linux/os_linux.cpp line 6132: > >> 6130: t->interrupt(); >> 6131: t->osthread()->set_interrupted(false); >> 6132: } > > I think the problem of the native implementation of the timed wait is being fixed on the wrong level of abstraction. The interrupt() is much higher leve operation than the pthread_cond_signal, that we are actually targeting. And the latter would be also much cheaper. Could you replace the code with signaling every cond? We are not triggering the Java InterruptedException; when you look into the implementation of `JavaThread::interrupt` you'll see that unparking from those 3 types of conditions is all it's actually doing, setting and immediately unsetting the interrupted flag is a noop on posix systems. So I don't think this can be any cheaper. We are operating on a bit higher level only to reuse code that we would otherwise inline in here. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1234801665 From rvansa at openjdk.org Tue Jun 20 06:46:36 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Tue, 20 Jun 2023 06:46:36 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Mon, 19 Jun 2023 15:26:37 GMT, Anton Kozlov wrote: >> This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. >> >> This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. >> >> This commit does not handle timed waiting in non-Java threads other than WatcherThread. > > src/hotspot/os/linux/os_linux.cpp line 6124: > >> 6122: assert(list != NULL, "Thread list is null"); >> 6123: for (uint i = 0; i < list->length(); ++i) { >> 6124: JavaThread* t = list->thread_at(i); > > Please use Threads::threads_do, that includes non-Java threads as well. Good point, I'll use `Threads::java_threads_do`. Not sure what I could do with non-Java threads, I haven't found any use of `pthread_cond_timedwait` from non-Java threads except the WatcherThread (and we can't fix non-JVM threads, we need to track any condition to signal). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1234807274 From rvansa at openjdk.org Tue Jun 20 06:51:39 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Tue, 20 Jun 2023 06:51:39 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Mon, 19 Jun 2023 15:52:59 GMT, Anton Kozlov wrote: > Probably pthread_cond_timedwait() ? Should not be other *_timedwait be handled, like sem_timedwait()? Right, sorry for those typos. `sem_timedwait` is fine since it uses wall-clock time, while we're concerned about monotonic time shifts. > I assume this happens in the implementation of pthread_cond_timedwait and not the JVM caller function? Otherwise I can't see the code doing the recalculation. No, I remember that happened somewhere higher up in JVM (I have to insert some debug logs to see that), will check it out again. ------------- PR Comment: https://git.openjdk.org/crac/pull/85#issuecomment-1598207941 From rvansa at openjdk.org Tue Jun 20 08:34:46 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Tue, 20 Jun 2023 08:34:46 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v2] In-Reply-To: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: > This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. > > This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. > > This commit does not handle timed waiting in non-Java threads other than WatcherThread. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Use Threads::java_threads_do, check spurious timeouts ------------- Changes: - all: https://git.openjdk.org/crac/pull/85/files - new: https://git.openjdk.org/crac/pull/85/files/4bf707b6..bae1c541 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=85&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=85&range=00-01 Stats: 81 lines in 3 files changed: 40 ins; 22 del; 19 mod Patch: https://git.openjdk.org/crac/pull/85.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/85/head:pull/85 PR: https://git.openjdk.org/crac/pull/85 From rvansa at openjdk.org Tue Jun 20 08:38:40 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Tue, 20 Jun 2023 08:38:40 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Mon, 19 Jun 2023 15:52:59 GMT, Anton Kozlov wrote: > OS-level spurious wake-ups are fine, but propogating that to java level seems dangerous. I could not spot the example of the propogation. E.g. can that happen to Thread.sleep()? I've added checks for the actual duration to the test (on my machine the C/R takes about 800 ms, but I've tried with a longer wait time to give it a chance to fail) and the methods that are supposed to guarantee waiting (Thread.sleep/join, tryLock...) work correctly. `Object.wait` and `LockSupport.parkUntil` return spuriously after restore, but spurious return is explicitly allowed in javadoc. `Condition.await` is allowed to return but the implementation tracks the deadline internally and does not return before that. ------------- PR Comment: https://git.openjdk.org/crac/pull/85#issuecomment-1598353809 From rmarchenko at openjdk.org Tue Jun 20 08:59:41 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 20 Jun 2023 08:59:41 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v3] In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'crac' into crac-extract - Merge branch 'crac' into crac-extract - Getting rid of crac::Linux - Removing trailing spaces - Refactoring - extracted crac* files ------------- Changes: https://git.openjdk.org/crac/pull/84/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=84&range=02 Stats: 2355 lines in 12 files changed: 1209 ins; 1139 del; 7 mod Patch: https://git.openjdk.org/crac/pull/84.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/84/head:pull/84 PR: https://git.openjdk.org/crac/pull/84 From rmarchenko at openjdk.org Tue Jun 20 09:29:56 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 20 Jun 2023 09:29:56 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v4] In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. Roman Marchenko has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Merge branch 'crac' into crac-extract ------------- Changes: - all: https://git.openjdk.org/crac/pull/84/files - new: https://git.openjdk.org/crac/pull/84/files/40db7d60..780d889b Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=84&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=84&range=02-03 Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/crac/pull/84.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/84/head:pull/84 PR: https://git.openjdk.org/crac/pull/84 From akozlov at openjdk.org Tue Jun 20 10:09:39 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 20 Jun 2023 10:09:39 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v2] In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Tue, 20 Jun 2023 06:37:33 GMT, Radim Vansa wrote: >> src/hotspot/os/linux/os_linux.cpp line 6132: >> >>> 6130: t->interrupt(); >>> 6131: t->osthread()->set_interrupted(false); >>> 6132: } >> >> I think the problem of the native implementation of the timed wait is being fixed on the wrong level of abstraction. The interrupt() is much higher leve operation than the pthread_cond_signal, that we are actually targeting. And the latter would be also much cheaper. Could you replace the code with signaling every cond? > > We are not triggering the Java InterruptedException; when you look into the implementation of `JavaThread::interrupt` you'll see that unparking from those 3 types of conditions is all it's actually doing, setting and immediately unsetting the interrupted flag is a noop on posix systems. > So I don't think this can be any cheaper. We are operating on a bit higher level only to reuse code that we would otherwise inline in here. t->interrupt() triggers unparks of two ParkEvents and of a Parker. They do some bookkeeping in addition to just pthread_cond_timedwait(). But OK, I see we need to actually unpark the thread for park() callers to recalculate the abs time. Anyway, Thread::interrupt looks like a hack. AFAICS unpark()s can be called without Java thread state check, which also will be more straightforward. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1235042673 From akozlov at openjdk.org Tue Jun 20 11:28:30 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 20 Jun 2023 11:28:30 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v2] In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: <46mAH77gkz1rNnGimPqL6WVeZs8VPOYtvo4xtIQmlKQ=.53aeb205-6ff0-40e5-8a49-27bd62c8b299@github.com> On Tue, 20 Jun 2023 08:34:46 GMT, Radim Vansa wrote: >> This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. >> >> This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. >> >> This commit does not handle timed waiting in non-Java threads other than WatcherThread. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use Threads::java_threads_do, check spurious timeouts src/hotspot/os/linux/os_linux.cpp line 6121: > 6119: class WakeupClosure: public ThreadClosure { > 6120: void do_thread(Thread* thread) { > 6121: if (thread->is_Java_thread()) { Redundant check? Since this is being called from java_threads_do ------------- PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1235043594 From rmarchenko at openjdk.org Tue Jun 20 17:53:00 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 20 Jun 2023 17:53:00 GMT Subject: [crac] RFR: PID adjustment on checkpoint Message-ID: On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. To set a custom value, `CRAC_MIN_PID` environment variable should be used. Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. ------------- Commit messages: - Implemented PID adjustment for checkpoint Changes: https://git.openjdk.org/crac/pull/86/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=86&range=00 Stats: 154 lines in 3 files changed: 144 ins; 0 del; 10 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rmarchenko at openjdk.org Wed Jun 21 06:07:36 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 06:07:36 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v2] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. > > To set a custom value, `CRAC_MIN_PID` environment variable should be used. > Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Refactoring ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/0c4a9ba5..9575111d Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=00-01 Stats: 15 lines in 2 files changed: 2 ins; 2 del; 11 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rmarchenko at openjdk.org Wed Jun 21 06:17:39 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 06:17:39 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. > > To set a custom value, `CRAC_MIN_PID` environment variable should be used. > Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Disabling the test for non-linux platforms ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/9575111d..dcbfa624 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rvansa at openjdk.org Wed Jun 21 08:23:38 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 21 Jun 2023 08:23:38 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 06:17:39 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. >> >> To set a custom value, `CRAC_MIN_PID` environment variable should be used. >> Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Disabling the test for non-linux platforms src/java.base/share/native/launcher/main.c line 191: > 189: FILE *last_pid_file = fopen(last_pid_filename, "w"); > 190: if (!last_pid_file) { > 191: perror("last_pid_file fopen"); Should we have this message printed, even if we can achieve it through spinning? Looks like a unnecessary noise when we're still 'fine'. src/java.base/share/native/launcher/main.c line 326: > 324: if (is_checkpoint) { > 325: const int crac_min_pid_default = 128; > 326: const char *env_min_pid_str = getenv("CRAC_MIN_PID"); Why using an environment variable rather than JVM option? Env vars cannot be 'listed', so the user can't know about them without finding a specific place in documentation. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236594123 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236597693 From rvansa at openjdk.org Wed Jun 21 08:31:42 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 21 Jun 2023 08:31:42 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v4] In-Reply-To: References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: On Tue, 20 Jun 2023 09:29:56 GMT, Roman Marchenko wrote: >> CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. > > Roman Marchenko has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Merge branch 'crac' into crac-extract LGTM, thanks! ------------- Marked as reviewed by rvansa (Committer). PR Review: https://git.openjdk.org/crac/pull/84#pullrequestreview-1490124073 From rvansa at openjdk.org Wed Jun 21 09:13:34 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 21 Jun 2023 09:13:34 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v2] In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Tue, 20 Jun 2023 10:06:57 GMT, Anton Kozlov wrote: >> We are not triggering the Java InterruptedException; when you look into the implementation of `JavaThread::interrupt` you'll see that unparking from those 3 types of conditions is all it's actually doing, setting and immediately unsetting the interrupted flag is a noop on posix systems. >> So I don't think this can be any cheaper. We are operating on a bit higher level only to reuse code that we would otherwise inline in here. > > t->interrupt() triggers unparks of two ParkEvents and of a Parker. They do some bookkeeping in addition to just pthread_cond_timedwait(). But OK, I see we need to actually unpark the thread for park() callers to recalculate the abs time. Anyway, Thread::interrupt looks like a hack. AFAICS unpark()s can be called without Java thread state check, which also will be more straightforward. The thread state check might be more of an optimization, even though unparking a non-parked thread is a noop, it still means some work that's unnecessary. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1236673967 From jkratochvil at openjdk.org Wed Jun 21 09:46:53 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 21 Jun 2023 09:46:53 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v36] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 112 commits: - Merge branch 'crac-altstack' into crac-altstack-cpu-cpuexplicit-strip - Merge branch 'crac' into crac-altstack - Merge branch 'crac' into crac-altstack-cpu-cpuexplicit-strip - Fix slowdebug compilation. Split better get_processor_features_hardware/get_processor_features_hotspot(). - Compatibility with non-X86; untested. - Simplify error reporting by err_msg(). - Fix printing missing features on target CPU. - Fix hotspot 'ht' vs. glibc 'htt'. - CPUFeatures refactorization. Start CPU Features checking without libc. - Reintroduce initialize_processor_count() requiring -XX:+CRaCCPUCountInit. - requested by Anton Kozlov - ... and 102 more: https://git.openjdk.org/crac/compare/a8e69ded...39b11c79 ------------- Changes: https://git.openjdk.org/crac/pull/41/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=41&range=35 Stats: 1085 lines in 18 files changed: 1048 ins; 12 del; 25 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Wed Jun 21 09:56:51 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 21 Jun 2023 09:56:51 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v37] In-Reply-To: References: Message-ID: <7nJSXdOYiiooyt_potwnu9H8aBJEcO7NfVRPMSuVaOg=.3aba0f96-7f6e-41d0-8aa1-2a47f9365808@github.com> > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Rename nonlibc_tty_print_using_features_cr() -> print_using_features_cr(). ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/39b11c79..c0b9282d Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=36 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=35-36 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Wed Jun 21 09:56:59 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 21 Jun 2023 09:56:59 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v35] In-Reply-To: References: Message-ID: On Thu, 15 Jun 2023 11:56:15 GMT, Anton Kozlov wrote: >> Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 110 commits: >> >> - Merge branch 'crac' into crac-altstack-cpu-cpuexplicit-strip >> - Fix slowdebug compilation. >> Split better get_processor_features_hardware/get_processor_features_hotspot(). >> - Compatibility with non-X86; untested. >> - Simplify error reporting by err_msg(). >> - Fix printing missing features on target CPU. >> - Fix hotspot 'ht' vs. glibc 'htt'. >> - CPUFeatures refactorization. >> Start CPU Features checking without libc. >> - Reintroduce initialize_processor_count() requiring -XX:+CRaCCPUCountInit. >> - requested by Anton Kozlov >> - Remove initialize_processor_count(). >> - requested by Anton Kozlov >> - it was crashing for me for 4 CPU <-> 16 CPU moves >> - Reintroduce the "leftover" code which was not leftover. >> - ... and 100 more: https://git.openjdk.org/crac/compare/a282698d...7c567e99 > > src/hotspot/cpu/x86/vm_version_x86.cpp line 2564: > >> 2562: >> 2563: if (ShowCPUFeatures) >> 2564: nonlibc_tty_print_using_features_cr(); > > Do I understand correctly that at this point all features are checked, and this can be a regular printing function call, involving libc? In fact it was already using a regular printing function (`print_cr()`). It was just a wrongly named function, renamed it to: `print_using_features_cr()` > src/hotspot/cpu/x86/vm_version_x86.cpp line 2608: > >> 2606: >> 2607: if (ShowCPUFeatures) >> 2608: nonlibc_tty_print_using_features_cr(); > > Can be a regular printing function call? In fact it was already using a regular printing function (`print_cr()`). It was just a wrongly named function, renamed it to: `print_using_features_cr()` ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1236723880 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1236724418 From rmarchenko at openjdk.org Wed Jun 21 11:46:32 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 11:46:32 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 08:19:58 GMT, Radim Vansa wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 326: > >> 324: if (is_checkpoint) { >> 325: const int crac_min_pid_default = 128; >> 326: const char *env_min_pid_str = getenv("CRAC_MIN_PID"); > > Why using an environment variable rather than JVM option? Env vars cannot be 'listed', so the user can't know about them without finding a specific place in documentation. This is because of PID adjustment happens before JVM inits, so to use JVM option we need to have JVM option parsing implemented in main.c (at least for one option), more complicated than it was done in parse_checkpoint(). That is why I've decided to use env var instead. OTOH maybe it worth considering moving PID adjustment & forking implementation to java.c, closer to JVM_Init call. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236864745 From rvansa at openjdk.org Wed Jun 21 12:45:14 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 21 Jun 2023 12:45:14 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v3] In-Reply-To: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: > This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. > > This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. > > This commit does not handle timed waiting in non-Java threads other than WatcherThread. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Inline Thread::interrupt ------------- Changes: - all: https://git.openjdk.org/crac/pull/85/files - new: https://git.openjdk.org/crac/pull/85/files/bae1c541..df9ed092 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=85&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=85&range=01-02 Stats: 16 lines in 2 files changed: 7 ins; 5 del; 4 mod Patch: https://git.openjdk.org/crac/pull/85.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/85/head:pull/85 PR: https://git.openjdk.org/crac/pull/85 From akozlov at openjdk.org Wed Jun 21 13:31:37 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 13:31:37 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 08:17:41 GMT, Radim Vansa wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 191: > >> 189: FILE *last_pid_file = fopen(last_pid_filename, "w"); >> 190: if (!last_pid_file) { >> 191: perror("last_pid_file fopen"); > > Should we have this message printed, even if we can achieve it through spinning? Looks like a unnecessary noise when we're still 'fine'. This can be an error, as `ns_last_pid` is expected to exist and be openable. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236950495 From akozlov at openjdk.org Wed Jun 21 13:31:39 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 13:31:39 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 06:17:39 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. >> >> To set a custom value, `CRAC_MIN_PID` environment variable should be used. >> Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Disabling the test for non-linux platforms src/java.base/share/native/launcher/main.c line 194: > 192: return; > 193: } > 194: if (0 > fprintf(last_pid_file, "%d", pid)) { Honestly I'd prefer write() to fprintf, to eliminate any possibility of buffering to write partial value to the kernel-exposed file. src/java.base/share/native/launcher/main.c line 195: > 193: } > 194: if (0 > fprintf(last_pid_file, "%d", pid)) { > 195: perror("last_pid_file fprintf"); We expect EPERM/EACCES, in these cases there should be no error printed. src/java.base/share/native/launcher/main.c line 201: > 199: > 200: static void spin_last_pid(int pid) { > 201: for (pid_t child = fork(); child < (pid_t)pid; child = fork()) { Although this part is not peformance critical, in my previous experiments, fork() was considerably more expensive compared to some minimalistic thread or vfork(). The best I found was `syscall(SYS_clone, SIGCHLD, NULL, NULL, 0)`. Could you check can that be made not too ugly? Maybe pthread_create? src/java.base/share/native/launcher/main.c line 209: > 207: perror("waitpid last pid"); > 208: break; > 209: } Can we drop this completely and rely on wait_for_children() below in the control flow? src/java.base/share/native/launcher/main.c line 328: > 326: const char *env_min_pid_str = getenv("CRAC_MIN_PID"); > 327: const int env_min_pid = env_min_pid_str ? atoi(env_min_pid_str) : 0; > 328: // TODO: should it be checked for max pid overflow? I don't quite follow max_pid problem. Could you elaborate? As for me, I don't see the need for max pid check src/java.base/share/native/launcher/main.c line 331: > 329: const int crac_min_pid = 0 < env_min_pid ? env_min_pid : crac_min_pid_default; > 330: > 331: if (getpid() <= crac_min_pid) { A nit: probably `getpid() < crac_min_pid`? src/java.base/share/native/launcher/main.c line 339: > 337: spin_last_pid(crac_min_pid); > 338: } > 339: } I think we may drop get_last_pid completely. E.g. it should be enough to if (!set_last_pid(crac_min_pid - 1)) { // set_last_pid reports status spin_last_pid(...) } src/java.base/share/native/launcher/main.c line 344: > 342: // by creating the main process waiting for children before exit. > 343: g_child_pid = fork(); > 344: if (0 < g_child_pid) { Does it make sense to check if `g_child_pid < crac_min_pid`? As the ns_last_pid is a subject for races, and instead of guessing what the next pid will be, we'll have the ability to explicity check if the child has the right pid. The child will have to compare its pid with the expected crac_min_pid. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236978265 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236948875 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236968013 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236968977 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236953632 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236957141 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236956565 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236976744 From akozlov at openjdk.org Wed Jun 21 13:31:39 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 13:31:39 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 11:43:38 GMT, Roman Marchenko wrote: >> src/java.base/share/native/launcher/main.c line 326: >> >>> 324: if (is_checkpoint) { >>> 325: const int crac_min_pid_default = 128; >>> 326: const char *env_min_pid_str = getenv("CRAC_MIN_PID"); >> >> Why using an environment variable rather than JVM option? Env vars cannot be 'listed', so the user can't know about them without finding a specific place in documentation. > > This is because of PID adjustment happens before JVM inits, so to use JVM option we need to have JVM option parsing implemented in main.c (at least for one option), more complicated than it was done in parse_checkpoint(). That is why I've decided to use env var instead. ~~OTOH maybe it worth considering moving PID adjustment & forking implementation to java.c, closer to JVM_Init call.~~ But no, option parsing is performed in JavaMain, running in a separate thread. This is not convenient for fork'ing. I would prefer -XX option as well. E.g. -XX:NativeMemoryTracking= is also handled in the launcher. Although in this case the option won't be handled in JVM at all, I think providing a common interface is better, since the option is expected to be changed in some product use-cases. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1236946959 From rmarchenko at openjdk.org Wed Jun 21 13:31:39 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 13:31:39 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 12:55:33 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 331: > >> 329: const int crac_min_pid = 0 < env_min_pid ? env_min_pid : crac_min_pid_default; >> 330: >> 331: if (getpid() <= crac_min_pid) { > > A nit: probably `getpid() < crac_min_pid`? This should work for pid==1 as well. The possible values for crac_min_pid are 1, 2, 3, ... In case of `getpid() < crac_min_pid`, it needs to have an additional check. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237000098 From rmarchenko at openjdk.org Wed Jun 21 13:35:36 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 13:35:36 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: <-C3Y4zUGHNcyD2ZP_YAEwqUBRXNW_h1DSJreGe4yFAI=.107ad82b-20b5-46b8-a4e8-501f6107c13f@github.com> On Wed, 21 Jun 2023 13:04:31 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 209: > >> 207: perror("waitpid last pid"); >> 208: break; >> 209: } > > Can we drop this completely and rely on wait_for_children() below in the control flow? Sure, however, will it be OK if we create hundreds or thousands of child processes at the same time, even if they exist soon? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237010834 From akozlov at openjdk.org Wed Jun 21 13:44:50 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 13:44:50 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v37] In-Reply-To: <7nJSXdOYiiooyt_potwnu9H8aBJEcO7NfVRPMSuVaOg=.3aba0f96-7f6e-41d0-8aa1-2a47f9365808@github.com> References: <7nJSXdOYiiooyt_potwnu9H8aBJEcO7NfVRPMSuVaOg=.3aba0f96-7f6e-41d0-8aa1-2a47f9365808@github.com> Message-ID: On Wed, 21 Jun 2023 09:56:51 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > Rename nonlibc_tty_print_using_features_cr() -> print_using_features_cr(). Marked as reviewed by akozlov (Lead). ------------- PR Review: https://git.openjdk.org/crac/pull/41#pullrequestreview-1490776465 From rmarchenko at openjdk.org Wed Jun 21 13:46:39 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 13:46:39 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 12:52:51 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 328: > >> 326: const char *env_min_pid_str = getenv("CRAC_MIN_PID"); >> 327: const int env_min_pid = env_min_pid_str ? atoi(env_min_pid_str) : 0; >> 328: // TODO: should it be checked for max pid overflow? > > I don't quite follow max_pid problem. Could you elaborate? As for me, I don't see the need for max pid check As far as I understand the max PID is not INT_MAX, it is 2^22. In case of CRAC_MIN_PID is greater than 2^22, we'd have a possible infinite loop while spinning the last PID. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237027167 From rmarchenko at openjdk.org Wed Jun 21 13:53:35 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 13:53:35 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 12:55:05 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 339: > >> 337: spin_last_pid(crac_min_pid); >> 338: } >> 339: } > > I think we may drop get_last_pid completely. E.g. it should be enough to > > if (!set_last_pid(crac_min_pid - 1)) { // set_last_pid reports status > spin_last_pid(...) > } The last PID may be greater than the current PID, so there might be the last PID satisfying CRaC requirements. If so, there is no need to set the last PID at all in such case. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237037065 From rmarchenko at openjdk.org Wed Jun 21 13:58:40 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 13:58:40 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 12:50:37 GMT, Anton Kozlov wrote: >> src/java.base/share/native/launcher/main.c line 191: >> >>> 189: FILE *last_pid_file = fopen(last_pid_filename, "w"); >>> 190: if (!last_pid_file) { >>> 191: perror("last_pid_file fopen"); >> >> Should we have this message printed, even if we can achieve it through spinning? Looks like a unnecessary noise when we're still 'fine'. > > This can be an error, as `ns_last_pid` is expected to exist and be openable. In non-privileged containers, ns_last_pid cannot be even open for writing. So this message may be considered as a warning, or info, or suppressed at all as Radim suggested. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237045021 From rmarchenko at openjdk.org Wed Jun 21 13:58:40 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 13:58:40 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 13:03:44 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 201: > >> 199: >> 200: static void spin_last_pid(int pid) { >> 201: for (pid_t child = fork(); child < (pid_t)pid; child = fork()) { > > Although this part is not peformance critical, in my previous experiments, fork() was considerably more expensive compared to some minimalistic thread or vfork(). The best I found was `syscall(SYS_clone, SIGCHLD, NULL, NULL, 0)`. Could you check can that be made not too ugly? Maybe pthread_create? ok, I will check ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237046166 From rmarchenko at openjdk.org Wed Jun 21 14:06:39 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 14:06:39 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 13:10:37 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 344: > >> 342: // by creating the main process waiting for children before exit. >> 343: g_child_pid = fork(); >> 344: if (0 < g_child_pid) { > > Does it make sense to check if `g_child_pid < crac_min_pid`? As the ns_last_pid is a subject for races, and instead of guessing what the next pid will be, we'll have the ability to explicity check if the child has the right pid. The child will have to compare its pid with the expected crac_min_pid. I'm not sure I understand your thoughts. Could you explain what do you mean, please? `if (0 < g_child_pid)` here is just a detector for child/parent code to execute, nothing more. At this line, there is no need for child to check its PID. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237060402 From rvansa at openjdk.org Wed Jun 21 14:20:36 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 21 Jun 2023 14:20:36 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 13:43:49 GMT, Roman Marchenko wrote: >> src/java.base/share/native/launcher/main.c line 328: >> >>> 326: const char *env_min_pid_str = getenv("CRAC_MIN_PID"); >>> 327: const int env_min_pid = env_min_pid_str ? atoi(env_min_pid_str) : 0; >>> 328: // TODO: should it be checked for max pid overflow? >> >> I don't quite follow max_pid problem. Could you elaborate? As for me, I don't see the need for max pid check > > As far as I understand the max PID is not INT_MAX, it is 2^22. In case of CRAC_MIN_PID is greater than 2^22, we'd have a possible infinite loop while spinning the last PID. I wouldn't test what's in `/proc/sys/kernel/pid_max` explicitly, but if PID #N+1 is lower than PID #N we have probably overflowed and that's what we should detect, and fail gracefully. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237083787 From rvansa at openjdk.org Wed Jun 21 14:24:36 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 21 Jun 2023 14:24:36 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 06:17:39 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. >> >> To set a custom value, `CRAC_MIN_PID` environment variable should be used. >> Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Disabling the test for non-linux platforms I wonder, given your other work, what will be the path to support other platforms? If that's going to be just `#ifdef LINUX` you could already guard platform-specific code with that. ------------- PR Comment: https://git.openjdk.org/crac/pull/86#issuecomment-1600931750 From rmarchenko at openjdk.org Wed Jun 21 14:33:35 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 21 Jun 2023 14:33:35 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: <9O29dOymiI4vImxlX5ziJrX1_8R1fKZuG4Bw5ygTZuc=.48b5cf5d-e86d-4876-a4d4-e60350bcbab3@github.com> On Wed, 21 Jun 2023 14:21:21 GMT, Radim Vansa wrote: > I wonder, given your other work, what will be the path to support other platforms? If that's going to be just `#ifdef LINUX` you could already guard platform-specific code with that. Currently it's guarded with JAVAW and WIN32 ifdef's. And you're right, some part of this PR is Linux-specific. ------------- PR Comment: https://git.openjdk.org/crac/pull/86#issuecomment-1600948426 From akozlov at openjdk.org Wed Jun 21 16:41:29 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 16:41:29 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 13:55:03 GMT, Roman Marchenko wrote: >> This can be an error, as `ns_last_pid` is expected to exist and be openable. > > In non-privileged containers, ns_last_pid cannot be even open for writing. So this message may be considered as a warning, or info, or suppressed at all as Radim suggested. Oh, of course. In case we cannot even open that, and that is expected, we should not print anything. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237291645 From akozlov at openjdk.org Wed Jun 21 16:41:29 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 16:41:29 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 12:49:43 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 195: > >> 193: } >> 194: if (0 > fprintf(last_pid_file, "%d", pid)) { >> 195: perror("last_pid_file fprintf"); > > We expect EPERM/EACCES, in these cases there should be no error printed. I was wrong, the error is expected on open(). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237292350 From akozlov at openjdk.org Wed Jun 21 16:53:30 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 16:53:30 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: <-C3Y4zUGHNcyD2ZP_YAEwqUBRXNW_h1DSJreGe4yFAI=.107ad82b-20b5-46b8-a4e8-501f6107c13f@github.com> References: <-C3Y4zUGHNcyD2ZP_YAEwqUBRXNW_h1DSJreGe4yFAI=.107ad82b-20b5-46b8-a4e8-501f6107c13f@github.com> Message-ID: On Wed, 21 Jun 2023 13:32:28 GMT, Roman Marchenko wrote: >> src/java.base/share/native/launcher/main.c line 209: >> >>> 207: perror("waitpid last pid"); >>> 208: break; >>> 209: } >> >> Can we drop this completely and rely on wait_for_children() below in the control flow? > > Sure, however, will it be OK if we create hundreds or thousands of child processes at the same time, even if they exit soon? Performance-wise, it looks even better to collect them after java has started execution. But you're right, that will create a burst of short living processes. OK, probably this is a safer option. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237306617 From akozlov at openjdk.org Wed Jun 21 17:01:30 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 17:01:30 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 14:18:09 GMT, Radim Vansa wrote: >> As far as I understand the max PID is not INT_MAX, it is 2^22. In case of CRAC_MIN_PID is greater than 2^22, we'd have a possible infinite loop while spinning the last PID. > > I wouldn't test what's in `/proc/sys/kernel/pid_max` explicitly, but if PID #N+1 is lower than PID #N we have probably overflowed and that's what we should detect, and fail gracefully. Got it, in case user has provided some big value. For spin_last_pid, yes, it makes a lot of sense to build some safety, like Radim's suggestion. Or to limit number of attempts e.g. by the value itself (it should not take more than min_pid attempts to get pid>min_pid). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237316064 From akozlov at openjdk.org Wed Jun 21 17:20:32 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 21 Jun 2023 17:20:32 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 13:50:18 GMT, Roman Marchenko wrote: >> src/java.base/share/native/launcher/main.c line 339: >> >>> 337: spin_last_pid(crac_min_pid); >>> 338: } >>> 339: } >> >> I think we may drop get_last_pid completely. E.g. it should be enough to >> >> if (!set_last_pid(crac_min_pid - 1)) { // set_last_pid reports status >> spin_last_pid(...) >> } > > The last PID may be greater than the current PID, so there might be the last PID satisfying CRaC requirements. If so, there is no need to set the last PID at all in such case. But to understand that our next pid is fine, we need an extra read before write, and that read alone is of the same cost as the unconditional write. >> src/java.base/share/native/launcher/main.c line 344: >> >>> 342: // by creating the main process waiting for children before exit. >>> 343: g_child_pid = fork(); >>> 344: if (0 < g_child_pid) { >> >> Does it make sense to check if `g_child_pid < crac_min_pid`? As the ns_last_pid is a subject for races, and instead of guessing what the next pid will be, we'll have the ability to explicity check if the child has the right pid. The child will have to compare its pid with the expected crac_min_pid. > > I'm not sure I understand your thoughts. Could you explain what do you mean, please? > > `if (0 < g_child_pid)` here is just a detector for child/parent code to execute, nothing more. At this line, there is no need for child to check its PID. I mean to check child pid (that is `g_check_pid), does it satisfy our min pid requirements. Right now, we have to check last_ns_pid and _assume_ what will be the child pid. Instead, we may create the child and based on the pid it gets, decide if the requirement is satifised, or we need another fork. Althought this will require one extra fork, if last_ns_pid is writeable. This actually relates to https://github.com/openjdk/crac/pull/86#discussion_r1236956565. OK, please disregard this. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237338111 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1237339338 From rmarchenko at openjdk.org Thu Jun 22 08:43:35 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 22 Jun 2023 08:43:35 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: <2IhE_q4IAWo-POKIZfvrGXcJErW7KOVQqGAznyqCu7U=.7c5ee4b3-3862-44e4-ba9e-f3e3d3a3c434@github.com> On Wed, 21 Jun 2023 16:58:47 GMT, Anton Kozlov wrote: >> I wouldn't test what's in `/proc/sys/kernel/pid_max` explicitly, but if PID #N+1 is lower than PID #N we have probably overflowed and that's what we should detect, and fail gracefully. > > Got it, in case user has provided some big value. For spin_last_pid, yes, it makes a lot of sense to build some safety, like Radim's suggestion. Or to limit number of attempts e.g. by the value itself (it should not take more than min_pid attempts to get pid>min_pid). Should Java fails if PID cannot be moved to a desired PID value? Or just warn and go on? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1238201039 From rvansa at openjdk.org Thu Jun 22 12:32:41 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 22 Jun 2023 12:32:41 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: <2IhE_q4IAWo-POKIZfvrGXcJErW7KOVQqGAznyqCu7U=.7c5ee4b3-3862-44e4-ba9e-f3e3d3a3c434@github.com> References: <2IhE_q4IAWo-POKIZfvrGXcJErW7KOVQqGAznyqCu7U=.7c5ee4b3-3862-44e4-ba9e-f3e3d3a3c434@github.com> Message-ID: On Thu, 22 Jun 2023 08:40:27 GMT, Roman Marchenko wrote: >> Got it, in case user has provided some big value. For spin_last_pid, yes, it makes a lot of sense to build some safety, like Radim's suggestion. Or to limit number of attempts e.g. by the value itself (it should not take more than min_pid attempts to get pid>min_pid). > > Should Java fail if PID cannot be moved to a desired PID value? Or just warn and go on? If the value was explicitly set, I think it would be better to fail. When it's trying to get to PID 128 by default I think it is sufficient to warn the user **and** tell him that he could switch off the warning setting `-XX:CRMinPid=1`. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1238459303 From rmarchenko at openjdk.org Thu Jun 22 13:33:39 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 22 Jun 2023 13:33:39 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: <2IhE_q4IAWo-POKIZfvrGXcJErW7KOVQqGAznyqCu7U=.7c5ee4b3-3862-44e4-ba9e-f3e3d3a3c434@github.com> Message-ID: <-4g74w8wHfidpoHvVKOSEaKfT-NSnSqVBN4727duTMs=.80cd3698-ed79-48bc-bcad-5999a12d963a@github.com> On Thu, 22 Jun 2023 12:29:48 GMT, Radim Vansa wrote: >> Should Java fail if PID cannot be moved to a desired PID value? Or just warn and go on? > > If the value was explicitly set, I think it would be better to fail. When it's trying to get to PID 128 by default I think it is sufficient to warn the user **and** tell him that he could switch off the warning setting `-XX:CRMinPid=1`. I did some experiments with PID spinning and a desired PID value that exceeds max_pid. It takes too long to spin PID until PID overflows. In case of a wrong value set by an user, this may seem like java hangs, so the user cannot wait so long to see the error message. This is also true for a valid desired PID value which is pretty big, e.g. 2_000_000. We could remove `waitpid()` call to speed up PID spinning, but by removing this, we can easily reach container's resource limits (I tested this), so we cannot remove `waitpid` easily. To avoid reading from `kernel/pid_max` and to avoid hanging on PID spinning, we could introduce max number of spin tries, say 10_000. If we reach this limit while spinning PIDs, we'd stop spinning and continue run Java with the currently reached PID. It actually seems doubtful that users want to move PID to 2M starting with PID=1 or 8 in a container. If users have some processes running in their container, on checkpoint they'd adjust desirable PID value in accordance with the state of the container, limited by a max try count we introducing. This solution seems portable for POSIX-like platforms. Or, since things're becoming so complicated, it'd be easier to read `pid_max`, only for Linux though. Please note that I'm talking about PID spinning, i.e. a case when writing to `ns_last_pid` haven't worked for some reasons. Are there any additional pro/cons? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1238530470 From rmarchenko at openjdk.org Thu Jun 22 13:53:41 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 22 Jun 2023 13:53:41 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 13:11:44 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Disabling the test for non-linux platforms > > src/java.base/share/native/launcher/main.c line 194: > >> 192: return; >> 193: } >> 194: if (0 > fprintf(last_pid_file, "%d", pid)) { > > Honestly I'd prefer write() to fprintf, to eliminate any possibility of buffering to write partial value to the kernel-exposed file. It seems like I've just met a case when `fprintf` returns ok (i.e. the correct number of bytes written), but the file content isn't changed. `strace`'ing shows that underlying `write` call returns `EPERM`, which probably wasn't forward to a caller. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1238562535 From rvansa at openjdk.org Thu Jun 22 13:53:41 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 22 Jun 2023 13:53:41 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: <-4g74w8wHfidpoHvVKOSEaKfT-NSnSqVBN4727duTMs=.80cd3698-ed79-48bc-bcad-5999a12d963a@github.com> References: <2IhE_q4IAWo-POKIZfvrGXcJErW7KOVQqGAznyqCu7U=.7c5ee4b3-3862-44e4-ba9e-f3e3d3a3c434@github.com> <-4g74w8wHfidpoHvVKOSEaKfT-NSnSqVBN4727duTMs=.80cd3698-ed79-48bc-bcad-5999a12d963a@github.com> Message-ID: On Thu, 22 Jun 2023 13:28:28 GMT, Roman Marchenko wrote: >> If the value was explicitly set, I think it would be better to fail. When it's trying to get to PID 128 by default I think it is sufficient to warn the user **and** tell him that he could switch off the warning setting `-XX:CRMinPid=1`. > > I did some experiments with PID spinning and a desired PID value that exceeds max_pid. It takes too long to spin PID until PID overflows. In case of a wrong value set by an user, this may seem like java hangs, so the user cannot wait so long to see the error message. This is also true for a valid desired PID value which is pretty big, e.g. 2_000_000. > > We could remove `waitpid()` call to speed up PID spinning, but by removing this, we can easily reach container's resource limits (I tested this), so we cannot remove `waitpid` easily. > > To avoid reading from `kernel/pid_max` and to avoid hanging on PID spinning, we could introduce max number of spin tries, say 10_000. If we reach this limit while spinning PIDs, we'd stop spinning and continue run Java with the currently reached PID. It actually seems doubtful that users want to move PID to 2M starting with PID=1 or 8 in a container. If users have some processes running in their container, on checkpoint they'd adjust desirable PID value in accordance with the state of the container, limited by a max try count we introducing. This solution seems portable for POSIX-like platforms. > > Or, since things're becoming so complicated, it'd be easier to read `pid_max`, only for Linux though. > > Please note that I'm talking about PID spinning, i.e. a case when writing to `ns_last_pid` haven't worked for some reasons. > > Are there any additional pro/cons? I agree that setting min pid to 2M is unlikely to be practical. Spinning fixed number of times is simpler, but if anything like this happens it would make sense to print a warning message (e.g. We are cycling PIDs due to -XX:CRMinPID=200000) after certain elapsed time (3 seconds?) rather than just number of iterations. Then the user can decide to cancel. Please make sure the stream gets flushed (in case this is written to a file or so...). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1238559758 From akozlov at openjdk.org Fri Jun 23 12:12:31 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 23 Jun 2023 12:12:31 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v3] In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Wed, 21 Jun 2023 09:10:30 GMT, Radim Vansa wrote: >> t->interrupt() triggers unparks of two ParkEvents and of a Parker. They do some bookkeeping in addition to just pthread_cond_timedwait(). But OK, I see we need to actually unpark the thread for park() callers to recalculate the abs time. Anyway, Thread::interrupt looks like a hack. AFAICS unpark()s can be called without Java thread state check, which also will be more straightforward. > > The thread state check might be more of an optimization, even though unparking a non-parked thread is a noop, it still means some work that's unnecessary. https://github.com/openjdk/crac/blob/crac/src/hotspot/share/runtime/osThread.hpp#L39: // The thread states represented by the ThreadState values are platform-specific // and are likely to be only approximate, because most OSes don't give you access // to precise thread state information. // Note: the ThreadState is legacy code and is not correctly implemented. // Uses of ThreadState need to be replaced by the state in the JavaThread. Relying on the ThreadState creates a risk of missing unparking, which does not pay off, since the operation is perfromed once on the restore. It's possible that ThreadState will become more broken over time as the rest of Hotspot implementation evolve. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1239740373 From akozlov at openjdk.org Fri Jun 23 12:23:02 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 23 Jun 2023 12:23:02 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v3] In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Wed, 21 Jun 2023 12:45:14 GMT, Radim Vansa wrote: >> This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. >> >> This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. >> >> This commit does not handle timed waiting in non-Java threads other than WatcherThread. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Inline Thread::interrupt src/hotspot/os/linux/os_linux.cpp line 6128: > 6126: > 6127: assert(thread->is_Java_thread(), "must be called from java_threads_do"); > 6128: JavaThread *jt = (JavaThread *) thread; A nit, just to mention: Suggestion: JavaThread* jt = thread->as_Java_thread(); ------------- PR Review Comment: https://git.openjdk.org/crac/pull/85#discussion_r1239749024 From rmarchenko at openjdk.org Fri Jun 23 13:03:47 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 23 Jun 2023 13:03:47 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v4] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. > > To set a custom value, `CRAC_MIN_PID` environment variable should be used. > Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. Roman Marchenko has updated the pull request incrementally with five additional commits since the last revision: - Implemented vm option instead of env var - Added test for the last pid overflow while spinning - Re-worked spin_last_pid procedure - Re-worked set_last_pid procedure - Replacing fork() with clone() for Linux platform ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/dcbfa624..63eff039 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=02-03 Stats: 158 lines in 5 files changed: 105 ins; 12 del; 41 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rmarchenko at openjdk.org Fri Jun 23 13:03:49 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 23 Jun 2023 13:03:49 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 12:48:18 GMT, Anton Kozlov wrote: >> This is because of PID adjustment happens before JVM inits, so to use JVM option we need to have JVM option parsing implemented in main.c (at least for one option), more complicated than it was done in parse_checkpoint(). That is why I've decided to use env var instead. ~~OTOH maybe it worth considering moving PID adjustment & forking implementation to java.c, closer to JVM_Init call.~~ But no, option parsing is performed in JavaMain, running in a separate thread. This is not convenient for fork'ing. > > I would prefer -XX option as well. E.g. -XX:NativeMemoryTracking= is also handled in the launcher. Although in this case the option won't be handled in JVM at all, I think providing a common interface is better, since the option is expected to be changed in some product use-cases. Implemented -XX:CRaCMinPid option ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1239785664 From rmarchenko at openjdk.org Fri Jun 23 13:03:51 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 23 Jun 2023 13:03:51 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: <2IhE_q4IAWo-POKIZfvrGXcJErW7KOVQqGAznyqCu7U=.7c5ee4b3-3862-44e4-ba9e-f3e3d3a3c434@github.com> <-4g74w8wHfidpoHvVKOSEaKfT-NSnSqVBN4727duTMs=.80cd3698-ed79-48bc-bcad-5999a12d963a@github.com> Message-ID: On Thu, 22 Jun 2023 13:48:33 GMT, Radim Vansa wrote: >> I did some experiments with PID spinning and a desired PID value that exceeds max_pid. It takes too long to spin PID until PID overflows. In case of a wrong value set by an user, this may seem like java hangs, so the user cannot wait so long to see the error message. This is also true for a valid desired PID value which is pretty big, e.g. 2_000_000. >> >> We could remove `waitpid()` call to speed up PID spinning, but by removing this, we can easily reach container's resource limits (I tested this), so we cannot remove `waitpid` easily. >> >> To avoid reading from `kernel/pid_max` and to avoid hanging on PID spinning, we could introduce max number of spin tries, say 10_000. If we reach this limit while spinning PIDs, we'd stop spinning and continue run Java with the currently reached PID. It actually seems doubtful that users want to move PID to 2M starting with PID=1 or 8 in a container. If users have some processes running in their container, on checkpoint they'd adjust desirable PID value in accordance with the state of the container, limited by a max try count we introducing. This solution seems portable for POSIX-like platforms. >> >> Or, since things're becoming so complicated, it'd be easier to read `pid_max`, only for Linux though. >> >> Please note that I'm talking about PID spinning, i.e. a case when writing to `ns_last_pid` haven't worked for some reasons. >> >> Are there any additional pro/cons? > > I agree that setting min pid to 2M is unlikely to be practical. Spinning fixed number of times is simpler, but if anything like this happens it would make sense to print a warning message (e.g. We are cycling PIDs due to -XX:CRMinPID=200000) after certain elapsed time (3 seconds?) rather than just number of iterations. Then the user can decide to cancel. Please make sure the stream gets flushed (in case this is written to a file or so...). Implemented PID spinning up to 10000 times. The 10000 is hardcoded now, we'd introduce an option if needed in future. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1239787503 From rmarchenko at openjdk.org Fri Jun 23 14:59:53 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 23 Jun 2023 14:59:53 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v5] In-Reply-To: References: Message-ID: <5ZXVSqB1y_x-MsUQYHSVpfkXSjqy7mbYOtrXOKuuw44=.cbf6f60e-c993-45c2-8e9b-efbe200a1adc@github.com> > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. > > To set a custom value, `CRAC_MIN_PID` environment variable should be used. > Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Removing unnecessary test code ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/63eff039..05afa134 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=03-04 Stats: 8 lines in 1 file changed: 0 ins; 8 del; 0 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From akozlov at openjdk.org Fri Jun 23 15:58:35 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 23 Jun 2023 15:58:35 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v4] In-Reply-To: References: Message-ID: On Fri, 23 Jun 2023 13:03:47 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. >> >> To set a custom value, `CRAC_MIN_PID` environment variable should be used. >> Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. > > Roman Marchenko has updated the pull request incrementally with five additional commits since the last revision: > > - Implemented vm option instead of env var > - Added test for the last pid overflow while spinning > - Re-worked spin_last_pid procedure > - Re-worked set_last_pid procedure > - Replacing fork() with clone() for Linux platform src/hotspot/share/runtime/globals.hpp line 2100: > 2098: "Path to image for restore, replaces the initializing VM on success") \ > 2099: \ > 2100: product(ccstr, CRaCMinPid, NULL, RESTORE_SETTABLE, \ `product(int, CRaCMinPid, 128`? Why is it RESTORE_SETTABLE, if the option is ignored on the restore (per the doc below)? src/java.base/share/native/launcher/main.c line 199: > 197: int res = 0; > 198: if (0 > write(last_pid_file, buf, len)) { > 199: res = errno; Nit: Extra spaces `= errno` ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1239872323 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1239874641 From akozlov at openjdk.org Fri Jun 23 15:58:35 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 23 Jun 2023 15:58:35 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v5] In-Reply-To: <5ZXVSqB1y_x-MsUQYHSVpfkXSjqy7mbYOtrXOKuuw44=.cbf6f60e-c993-45c2-8e9b-efbe200a1adc@github.com> References: <5ZXVSqB1y_x-MsUQYHSVpfkXSjqy7mbYOtrXOKuuw44=.cbf6f60e-c993-45c2-8e9b-efbe200a1adc@github.com> Message-ID: On Fri, 23 Jun 2023 14:59:53 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. >> >> To set a custom value, `CRAC_MIN_PID` environment variable should be used. >> Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Removing unnecessary test code src/java.base/share/native/launcher/main.c line 359: > 357: } else if (0 != res) { > 358: errno = res; > 359: perror("set_last_pid"); This apparently prints a textual version of the error. Better to avoid modifying the global state: Suggestion: fprintf(stderr, "set_last_pid: %s\n", strerror(res)) ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1239984260 From jkratochvil at openjdk.org Mon Jun 26 05:04:52 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Mon, 26 Jun 2023 05:04:52 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v38] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Do not remove the altstack fix which got committed as a part of: https://github.com/openjdk/crac/commit/4b0dc2dc9722945579c9772b335a44fa79f1729f It was originally submitted as: https://github.com/openjdk/crac/pull/37 ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/c0b9282d..0bc9afea Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=37 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=36-37 Stats: 11 lines in 1 file changed: 9 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Mon Jun 26 05:04:55 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Mon, 26 Jun 2023 05:04:55 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v37] In-Reply-To: <7nJSXdOYiiooyt_potwnu9H8aBJEcO7NfVRPMSuVaOg=.3aba0f96-7f6e-41d0-8aa1-2a47f9365808@github.com> References: <7nJSXdOYiiooyt_potwnu9H8aBJEcO7NfVRPMSuVaOg=.3aba0f96-7f6e-41d0-8aa1-2a47f9365808@github.com> Message-ID: On Wed, 21 Jun 2023 09:56:51 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > Rename nonlibc_tty_print_using_features_cr() -> print_using_features_cr(). It should get a reapproval so that the patch does not remove the `altstack` fix. ------------- PR Comment: https://git.openjdk.org/crac/pull/41#issuecomment-1606637399 From akozlov at openjdk.org Mon Jun 26 09:14:33 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 26 Jun 2023 09:14:33 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v38] In-Reply-To: References: Message-ID: On Mon, 26 Jun 2023 05:04:52 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > Do not remove the altstack fix which got committed as a part of: > https://github.com/openjdk/crac/commit/4b0dc2dc9722945579c9772b335a44fa79f1729f > It was originally submitted as: > https://github.com/openjdk/crac/pull/37 Marked as reviewed by akozlov (Lead). ------------- PR Review: https://git.openjdk.org/crac/pull/41#pullrequestreview-1498031301 From jkratochvil at openjdk.org Mon Jun 26 11:13:41 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Mon, 26 Jun 2023 11:13:41 GMT Subject: [crac] Integrated: RFC: -XX:CPUFeatures=0xnumber for CPU migration In-Reply-To: References: Message-ID: On Mon, 30 Jan 2023 02:44:23 GMT, Jan Kratochvil wrote: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent ... This pull request has now been integrated. Changeset: 146c49a9 Author: Jan Kratochvil Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/146c49a9001620e8f27082adbaa969f61c85e2c8 Stats: 1074 lines in 17 files changed: 1048 ins; 3 del; 23 mod RFC: -XX:CPUFeatures=0xnumber for CPU migration Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/41 From rmarchenko at openjdk.org Mon Jun 26 16:08:48 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Mon, 26 Jun 2023 16:08:48 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v6] In-Reply-To: References: Message-ID: <4ZYvzZDue5FwMTzT5RdEE9j7p5FPCRZwYBD2d23puqQ=.17d5a1d5-562a-46d1-8c2a-e7a9dabcc3f4@github.com> > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. > > To set a custom value, `CRAC_MIN_PID` environment variable should be used. > Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: - Adapting tests - Fixing JVM parameter definition - Set MaxSpinCount to PID value - Merge branch 'crac' into pid-adjustment - Removing unnecessary test code - Implemented vm option instead of env var - Added test for the last pid overflow while spinning - Re-worked spin_last_pid procedure - Re-worked set_last_pid procedure - Replacing fork() with clone() for Linux platform - ... and 3 more: https://git.openjdk.org/crac/compare/4ff40b58...67dbf0ed ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/05afa134..67dbf0ed Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=05 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=04-05 Stats: 1116 lines in 20 files changed: 1074 ins; 8 del; 34 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From akozlov at openjdk.org Mon Jun 26 16:18:45 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 26 Jun 2023 16:18:45 GMT Subject: [crac] RFR: Selectable Global Context implementation Message-ID: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> A number of problems were caused by switching to BlockingOrderedContext in the JDK code. To prevent the same in the potential users of EA builds, I temporarily switch the default implementation to the previous OrderedContext. The BlockingContext is still available via a command line option. After an EA build and transition of CRaC development to track newer version openjdk/jdk, the default implementation will be changed. To make properties available, it was required to delay Reference resource registration. Otherwise, the GlobalContext implementation decision had to be done too early, in the Reference initialization during JDK bootstrapping, before Properties are available. ------------- Commit messages: - WIP - Selectable Context Changes: https://git.openjdk.org/crac/pull/87/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=87&range=00 Stats: 73 lines in 4 files changed: 55 ins; 16 del; 2 mod Patch: https://git.openjdk.org/crac/pull/87.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/87/head:pull/87 PR: https://git.openjdk.org/crac/pull/87 From rmarchenko at openjdk.org Tue Jun 27 08:57:28 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 08:57:28 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v7] In-Reply-To: References: Message-ID: <_ZVuvPzsTkI1FrkdNwdklBDg5wBC49lVatN5c1iYEiY=.df70edef-002f-4ea3-96ed-aa7699037089@github.com> > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID value for new processes starts from 128. > > To set a custom value, `CRAC_MIN_PID` environment variable should be used. > Min `CRAC_MIN_PID` value is 1, max `CRAC_MIN_PID` is not implemented currently. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Now CracMinPid option must be set explicitly to adjust PID ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/67dbf0ed..b3d66800 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=06 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=05-06 Stats: 57 lines in 3 files changed: 2 ins; 19 del; 36 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rmarchenko at openjdk.org Tue Jun 27 09:04:37 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 09:04:37 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v3] In-Reply-To: References: Message-ID: On Wed, 21 Jun 2023 13:25:01 GMT, Roman Marchenko wrote: >> src/java.base/share/native/launcher/main.c line 331: >> >>> 329: const int crac_min_pid = 0 < env_min_pid ? env_min_pid : crac_min_pid_default; >>> 330: >>> 331: if (getpid() <= crac_min_pid) { >> >> A nit: probably `getpid() < crac_min_pid`? > > This should work for pid==1 as well. > The possible values for crac_min_pid are 1, 2, 3, ... > In case of `getpid() < crac_min_pid`, it needs to have an additional check. As we discussed this offline, now PID adjustment is not applied by default, it works only if CracMinPid option is set by an user. So 'fork & wait' now is applied for pid==1 or for PID adjustment. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1243389704 From rmarchenko at openjdk.org Tue Jun 27 11:45:56 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 11:45:56 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v8] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is not adjusted. To adjust PID for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. Roman Marchenko has updated the pull request incrementally with two additional commits since the last revision: - Fixing review comments - Revert "Now CracMinPid option must be set explicitly to adjust PID" This reverts commit b3d66800d6ea441fb86498fdbb229400747eb44f. ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/b3d66800..cc527e7e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=07 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=06-07 Stats: 39 lines in 3 files changed: 13 ins; 1 del; 25 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From snazarki at openjdk.org Tue Jun 27 12:09:32 2023 From: snazarki at openjdk.org (Sergey Nazarkin) Date: Tue, 27 Jun 2023 12:09:32 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v8] In-Reply-To: References: Message-ID: On Tue, 27 Jun 2023 11:45:56 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > Roman Marchenko has updated the pull request incrementally with two additional commits since the last revision: > > - Fixing review comments > - Revert "Now CracMinPid option must be set explicitly to adjust PID" > > This reverts commit b3d66800d6ea441fb86498fdbb229400747eb44f. Changes requested by snazarki (no project role). src/java.base/share/native/launcher/main.c line 122: > 120: const int len = strlen(checkpoint_arg); > 121: if (0 == strncmp(arg, checkpoint_arg, len)) { > 122: crac_min_pid = atoi(arg + len); atoi is not recommended to use anymore as it returns 0 on error. "It is recommended to instead use the strtol() and strtoul() family of functions in new programs." src/java.base/share/native/launcher/main.c line 195: > 193: } > 194: const char *last_pid_filename = "/proc/sys/kernel/ns_last_pid"; > 195: const int last_pid_file = open(last_pid_filename, O_WRONLY|O_CREAT|O_TRUNC, 0666); O_CREAT looks redundant. And this file requires special capability for the process. Shouldn't we address this in the doc? src/java.base/share/native/launcher/main.c line 200: > 198: } > 199: int res = 0; > 200: if (0 > write(last_pid_file, buf, len)) { I'd compare with len, just to handle all "write" return values ------------- PR Review: https://git.openjdk.org/crac/pull/86#pullrequestreview-1500668309 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1243612548 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1243617425 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1243620761 From rmarchenko at openjdk.org Tue Jun 27 15:09:36 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 15:09:36 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v8] In-Reply-To: References: Message-ID: <42qYEOgHLI32EWZXTiKJNbctMH6q56xWH6nijMl0hDQ=.67baa9d7-7b41-4259-af85-5cd05969b46c@github.com> On Tue, 27 Jun 2023 11:59:54 GMT, Sergey Nazarkin wrote: >> Roman Marchenko has updated the pull request incrementally with two additional commits since the last revision: >> >> - Fixing review comments >> - Revert "Now CracMinPid option must be set explicitly to adjust PID" >> >> This reverts commit b3d66800d6ea441fb86498fdbb229400747eb44f. > > src/java.base/share/native/launcher/main.c line 195: > >> 193: } >> 194: const char *last_pid_filename = "/proc/sys/kernel/ns_last_pid"; >> 195: const int last_pid_file = open(last_pid_filename, O_WRONLY|O_CREAT|O_TRUNC, 0666); > > O_CREAT looks redundant. > And this file requires special capability for the process. Shouldn't we address this in the doc? Agreed. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1243914093 From rmarchenko at openjdk.org Tue Jun 27 15:13:50 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 15:13:50 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v8] In-Reply-To: References: Message-ID: On Tue, 27 Jun 2023 11:55:18 GMT, Sergey Nazarkin wrote: >> Roman Marchenko has updated the pull request incrementally with two additional commits since the last revision: >> >> - Fixing review comments >> - Revert "Now CracMinPid option must be set explicitly to adjust PID" >> >> This reverts commit b3d66800d6ea441fb86498fdbb229400747eb44f. > > src/java.base/share/native/launcher/main.c line 122: > >> 120: const int len = strlen(checkpoint_arg); >> 121: if (0 == strncmp(arg, checkpoint_arg, len)) { >> 122: crac_min_pid = atoi(arg + len); > > atoi is not recommended to use anymore as it returns 0 on error. > "It is recommended to instead use the strtol() and strtoul() family of functions in new programs." You're right, but in this implementation we don't care about errno. It's enough to get 0 as a parsed wrong value, because CRaCMinPid is introduced as JVM option with the min value =1, so in case of wrong value, JVM prints an error message and stops. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1243920877 From rmarchenko at openjdk.org Tue Jun 27 15:26:58 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 15:26:58 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v9] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > There are the following possible scenarios for CRaC running in a container: > > // getpid CRaCMinPid | set_last_pid fork > // ------------------------------------------------ > // 1 - | yes (default) yes > // 1 1 | no yes > // 1 >1 | yes yes > // >1 - | no no > // >1 <=getpid | no no > // >1 getpid< | yes yes Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Fixing review comments ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/cc527e7e..bae34d3e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=08 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=07-08 Stats: 46 lines in 1 file changed: 8 ins; 8 del; 30 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rmarchenko at openjdk.org Tue Jun 27 15:27:02 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 15:27:02 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v8] In-Reply-To: References: Message-ID: On Tue, 27 Jun 2023 12:02:55 GMT, Sergey Nazarkin wrote: >> Roman Marchenko has updated the pull request incrementally with two additional commits since the last revision: >> >> - Fixing review comments >> - Revert "Now CracMinPid option must be set explicitly to adjust PID" >> >> This reverts commit b3d66800d6ea441fb86498fdbb229400747eb44f. > > src/java.base/share/native/launcher/main.c line 200: > >> 198: } >> 199: int res = 0; >> 200: if (0 > write(last_pid_file, buf, len)) { > > I'd compare with len, just to handle all "write" return values Done ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1243932966 From rmarchenko at openjdk.org Tue Jun 27 16:36:54 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 16:36:54 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v5] In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Merge branch 'crac' into crac-extract - Merge branch 'crac' into crac-extract - Merge branch 'crac' into crac-extract - Getting rid of crac::Linux - Removing trailing spaces - Refactoring - extracted crac* files ------------- Changes: https://git.openjdk.org/crac/pull/84/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=84&range=04 Stats: 2377 lines in 12 files changed: 1223 ins; 1147 del; 7 mod Patch: https://git.openjdk.org/crac/pull/84.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/84/head:pull/84 PR: https://git.openjdk.org/crac/pull/84 From rmarchenko at openjdk.org Tue Jun 27 16:36:56 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 27 Jun 2023 16:36:56 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v4] In-Reply-To: References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: <77vhnj67AYL1pQaL3fubJXagbekIEBA1tngqN2CqscM=.e0d927f3-e4b2-4968-bc4d-1134d8435367@github.com> On Tue, 20 Jun 2023 09:29:56 GMT, Roman Marchenko wrote: >> CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. > > Roman Marchenko has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Merge branch 'crac' into crac-extract @jankratochvil Hi, could you review this regarding your recent changes, please? ------------- PR Comment: https://git.openjdk.org/crac/pull/84#issuecomment-1609866337 From jkratochvil at openjdk.org Wed Jun 28 03:21:35 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 28 Jun 2023 03:21:35 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v4] In-Reply-To: <77vhnj67AYL1pQaL3fubJXagbekIEBA1tngqN2CqscM=.e0d927f3-e4b2-4968-bc4d-1134d8435367@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> <77vhnj67AYL1pQaL3fubJXagbekIEBA1tngqN2CqscM=.e0d927f3-e4b2-4968-bc4d-1134d8435367@github.com> Message-ID: <0l1c71wmsd4wNwf2ycy6YiYQcLIqnxMjDOY2GFyKaw0=.069e00dc-c8af-43b0-8558-9b44cb322a68@github.com> On Tue, 27 Jun 2023 16:32:41 GMT, Roman Marchenko wrote: > @jankratochvil Hi, could you review this regarding your recent changes, please? https://github.com/wkia/crac/pull/1 ------------- PR Comment: https://git.openjdk.org/crac/pull/84#issuecomment-1610624941 From rmarchenko at openjdk.org Wed Jun 28 08:30:07 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 28 Jun 2023 08:30:07 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v6] In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: <745VE3Xw5cQ42qgaJcAY8v9Lp9UrY0wSHXr0WrnqOf4=.49c9b597-dc02-4ad8-9949-21b1b847f14c@github.com> > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. Roman Marchenko has updated the pull request incrementally with two additional commits since the last revision: - Merge pull request #1 from jankratochvil/crac-extract-jan Fix compilation errors: - Fix compilation errors: commit 5d2fe3461534d56b0408788c8d7c1b94f85530c0 ../../src/hotspot/share/runtime/arguments.cpp: In static member function 'static jint Arguments::finalize_vm_init_args(bool)': ../../src/hotspot/share/runtime/arguments.cpp:3235:28: error: 'crac' has not been declared 3235 | if (CRaCCheckpointTo && !crac::prepare_checkpoint()) { | ^~~~ ../../src/hotspot/share/prims/jvm.cpp: In function '_jobjectArray* JVM_Checkpoint(JNIEnv*, jarray, jobjectArray, jboolean, jlong)': ../../src/hotspot/share/prims/jvm.cpp:3853:16: error: 'crac' has not been declared 3853 | Handle ret = crac::checkpoint(fd_arr, obj_arr, dry_run, jcmd_stream, CHECK_NULL); | ^~~~ ../../src/hotspot/share/services/management.cpp: In function 'jlong get_long_attribute(jmmLongAttribute)': ../../src/hotspot/share/services/management.cpp:957:12: error: 'crac' has not been declared 957 | return crac::restore_start_time(); | ^~~~ ../../src/hotspot/share/services/management.cpp:961:21: error: 'crac' has not been declared 961 | jlong ticks = crac::uptime_since_restore(); | ^~~~ Remove a no longer used declaration of os::Linux::checkpoint_restore(). ------------- Changes: - all: https://git.openjdk.org/crac/pull/84/files - new: https://git.openjdk.org/crac/pull/84/files/5d2fe346..c49a3a88 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=84&range=05 - incr: https://webrevs.openjdk.org/?repo=crac&pr=84&range=04-05 Stats: 6 lines in 5 files changed: 4 ins; 2 del; 0 mod Patch: https://git.openjdk.org/crac/pull/84.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/84/head:pull/84 PR: https://git.openjdk.org/crac/pull/84 From rmarchenko at openjdk.org Wed Jun 28 08:54:41 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 28 Jun 2023 08:54:41 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v10] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > There are the following possible scenarios for CRaC running in a container: > > // getpid CRaCMinPid | set_last_pid fork > // ------------------------------------------------ > // 1 - | yes (default) yes > // 1 1 | no yes > // 1 >1 | yes yes > // >1 - | no no > // >1 <=getpid | no no > // >1 getpid< | yes yes Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Added FIXME for further steps ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/bae34d3e..87601a1c Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=09 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=08-09 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rvansa at openjdk.org Wed Jun 28 11:45:28 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Wed, 28 Jun 2023 11:45:28 GMT Subject: [crac] RFR: Selectable Global Context implementation In-Reply-To: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> References: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> Message-ID: On Mon, 26 Jun 2023 16:12:24 GMT, Anton Kozlov wrote: > A number of problems were caused by switching to BlockingOrderedContext in the JDK code. To prevent the same in the potential users of EA builds, I temporarily switch the default implementation to the previous OrderedContext. The BlockingContext is still available via a command line option. After an EA build and transition of CRaC development to track newer version openjdk/jdk, the default implementation will be changed. > > To make properties available, it was required to delay Reference resource registration. Otherwise, the GlobalContext implementation decision had to be done too early, in the Reference initialization during JDK bootstrapping, before Properties are available. src/java.base/share/classes/jdk/crac/Core.java line 77: > 75: private static final Context globalContext = GlobalContext.createGlobalContextImpl(); > 76: > 77: private static class ReferenceHandlerResource implements JDKResource { I wouldn't pollute the `Core` class with domain-specific code; if the 'reliability' of registration is of concern I think it's fine to expose the static `ensureRegistered`/`register` method. Or, at least refactor this into its own file. src/java.base/share/classes/jdk/crac/impl/GlobalContext.java line 8: > 6: > 7: public class GlobalContext { > 8: private static final String GLOBAL_CONTEXT_IMPL_PROP = "jdk.crac.globalContext.impl"; As mentioned in off-GH conversation, I don't think this option would be ever used, and if you are going to change the default I'd rather drop it to keep the code simple. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1245086510 PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1245089177 From akozlov at openjdk.org Wed Jun 28 12:19:46 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 28 Jun 2023 12:19:46 GMT Subject: [crac] RFR: Fix JVMCI after #41 Message-ID: <-WR30gYJtf2g3rcDrRjciI5hog0ZXrxWK2PYz1OuSyY=.5478c7a5-c20d-4742-a33a-d86fe620d976@github.com> The recently added CPU_MAX feature went out of sync with JVMCI code [1]. Since this is not a real feature, but auxilary value, a distinct name for the maximum value solves the issue. [1] jtreg_test_hotspot_jtreg_tier1_compiler/compiler/jvmci/JVM_GetJVMCIRuntimeTest.jtr: jdk.vm.ci.common.JVMCIError: Missing CPU feature constants: [MAX] at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIBackendFactory.convertFeatures(HotSpotJVMCIBackendFactory.java:79) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.computeFeatures(AMD64HotSpotJVMCIBackendFactory.java:53) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.createTarget(AMD64HotSpotJVMCIBackendFactory.java:74) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.createJVMCIBackend(AMD64HotSpotJVMCIBackendFactory.java:109) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.(HotSpotJVMCIRuntime.java:549) at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.runtime(HotSpotJVMCIRuntime.java:176) at jdk.internal.vm.ci/jdk.vm.ci.runtime.JVMCI.initializeRuntime(Native Method) at jdk.internal.vm.ci/jdk.vm.ci.runtime.JVMCI.getRuntime(JVMCI.java:65) at compiler.jvmci.JVM_GetJVMCIRuntimeTest.run(JVM_GetJVMCIRuntimeTest.java:77) at compiler.jvmci.JVM_GetJVMCIRuntimeTest.main(JVM_GetJVMCIRuntimeTest.java:70) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:833) ------------- Commit messages: - Fix JVMCI after #41 Changes: https://git.openjdk.org/crac/pull/88/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=88&range=00 Stats: 12 lines in 2 files changed: 2 ins; 3 del; 7 mod Patch: https://git.openjdk.org/crac/pull/88.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/88/head:pull/88 PR: https://git.openjdk.org/crac/pull/88 From akozlov at openjdk.org Wed Jun 28 12:36:36 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 28 Jun 2023 12:36:36 GMT Subject: [crac] RFR: Selectable Global Context implementation In-Reply-To: References: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> Message-ID: On Wed, 28 Jun 2023 11:39:58 GMT, Radim Vansa wrote: >> A number of problems were caused by switching to BlockingOrderedContext in the JDK code. To prevent the same in the potential users of EA builds, I temporarily switch the default implementation to the previous OrderedContext. The BlockingContext is still available via a command line option. After an EA build and transition of CRaC development to track newer version openjdk/jdk, the default implementation will be changed. >> >> To make properties available, it was required to delay Reference resource registration. Otherwise, the GlobalContext implementation decision had to be done too early, in the Reference initialization during JDK bootstrapping, before Properties are available. > > src/java.base/share/classes/jdk/crac/Core.java line 77: > >> 75: private static final Context globalContext = GlobalContext.createGlobalContextImpl(); >> 76: >> 77: private static class ReferenceHandlerResource implements JDKResource { > > I wouldn't pollute the `Core` class with domain-specific code; if the 'reliability' of registration is of concern I think it's fine to expose the static `ensureRegistered`/`register` method. > Or, at least refactor this into its own file. I considered register() interface. The waitForReferenceProcessing is an internal JDK interface. The reference processing itself does not require the special managment on checkpoint, but it's checkpoint requirement as a procedure that no reference processing is in progress at some certain point. The class is a boilerplate code to call two methods, it is not shared and unlikely will, so a separate file is not justified. > src/java.base/share/classes/jdk/crac/impl/GlobalContext.java line 8: > >> 6: >> 7: public class GlobalContext { >> 8: private static final String GLOBAL_CONTEXT_IMPL_PROP = "jdk.crac.globalContext.impl"; > > As mentioned in off-GH conversation, I don't think this option would be ever used, and if you are going to change the default I'd rather drop it to keep the code simple. Since there is no way to employ BlockingContext other than as the global context implementation, I prefer to make it selectable. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1245141056 PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1245108228 From rmarchenko at openjdk.org Thu Jun 29 06:19:27 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 29 Jun 2023 06:19:27 GMT Subject: [crac] RFR: Fix JVMCI after #41 In-Reply-To: <-WR30gYJtf2g3rcDrRjciI5hog0ZXrxWK2PYz1OuSyY=.5478c7a5-c20d-4742-a33a-d86fe620d976@github.com> References: <-WR30gYJtf2g3rcDrRjciI5hog0ZXrxWK2PYz1OuSyY=.5478c7a5-c20d-4742-a33a-d86fe620d976@github.com> Message-ID: On Wed, 28 Jun 2023 12:13:16 GMT, Anton Kozlov wrote: > The recently added CPU_MAX feature went out of sync with JVMCI code [1]. Since this is not a real feature, but auxilary value, a distinct name for the maximum value solves the issue. > > [1] jtreg_test_hotspot_jtreg_tier1_compiler/compiler/jvmci/JVM_GetJVMCIRuntimeTest.jtr: > > jdk.vm.ci.common.JVMCIError: Missing CPU feature constants: [MAX] > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIBackendFactory.convertFeatures(HotSpotJVMCIBackendFactory.java:79) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.computeFeatures(AMD64HotSpotJVMCIBackendFactory.java:53) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.createTarget(AMD64HotSpotJVMCIBackendFactory.java:74) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.createJVMCIBackend(AMD64HotSpotJVMCIBackendFactory.java:109) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.(HotSpotJVMCIRuntime.java:549) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.runtime(HotSpotJVMCIRuntime.java:176) > at jdk.internal.vm.ci/jdk.vm.ci.runtime.JVMCI.initializeRuntime(Native Method) > at jdk.internal.vm.ci/jdk.vm.ci.runtime.JVMCI.getRuntime(JVMCI.java:65) > at compiler.jvmci.JVM_GetJVMCIRuntimeTest.run(JVM_GetJVMCIRuntimeTest.java:77) > at compiler.jvmci.JVM_GetJVMCIRuntimeTest.main(JVM_GetJVMCIRuntimeTest.java:70) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) > at java.base/java.lang.Thread.run(Thread.java:833) Marked as reviewed by rmarchenko (no project role). ------------- PR Review: https://git.openjdk.org/crac/pull/88#pullrequestreview-1504616371 From rvansa at openjdk.org Thu Jun 29 06:58:27 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 29 Jun 2023 06:58:27 GMT Subject: [crac] RFR: Selectable Global Context implementation In-Reply-To: References: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> Message-ID: On Wed, 28 Jun 2023 12:33:32 GMT, Anton Kozlov wrote: >> src/java.base/share/classes/jdk/crac/Core.java line 77: >> >>> 75: private static final Context globalContext = GlobalContext.createGlobalContextImpl(); >>> 76: >>> 77: private static class ReferenceHandlerResource implements JDKResource { >> >> I wouldn't pollute the `Core` class with domain-specific code; if the 'reliability' of registration is of concern I think it's fine to expose the static `ensureRegistered`/`register` method. >> Or, at least refactor this into its own file. > > I considered register() interface. The waitForReferenceProcessing is an internal JDK interface. The reference processing itself does not require the special managment on checkpoint, but it's checkpoint requirement as a procedure that no reference processing is in progress at some certain point. > > The class is a boilerplate code to call two methods, it is not shared and unlikely will, so a separate file is not justified. If it's a requirement (and not an optimization), how come it's not prevented from running? (the blocking was there for a time before you reverted that, wasn't it?) >> src/java.base/share/classes/jdk/crac/impl/GlobalContext.java line 8: >> >>> 6: >>> 7: public class GlobalContext { >>> 8: private static final String GLOBAL_CONTEXT_IMPL_PROP = "jdk.crac.globalContext.impl"; >> >> As mentioned in off-GH conversation, I don't think this option would be ever used, and if you are going to change the default I'd rather drop it to keep the code simple. > > Since there is no way to employ BlockingContext other than as the global context implementation, I prefer to make it selectable. OK, so in the future you intend to just change the default and provide non-blocking implementation as the workaround for cases where the registration is not synchronized properly? In that case it makes sense... ------------- PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1246188158 PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1246192507 From rmarchenko at openjdk.org Thu Jun 29 07:56:27 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 29 Jun 2023 07:56:27 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v11] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > There are the following possible scenarios for CRaC running in a container: > > // getpid CRaCMinPid | set_last_pid fork > // ------------------------------------------------ > // 1 - | yes (default) yes > // 1 1 | no yes > // 1 >1 | yes yes > // >1 - | no no > // >1 <=getpid | no no > // >1 getpid< | yes yes Roman Marchenko has updated the pull request incrementally with two additional commits since the last revision: - Merge branch 'pid1-tests' into pid-adjustment - Implemented running command in container directly ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/87601a1c..9deb1152 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=10 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=09-10 Stats: 78 lines in 2 files changed: 50 ins; 6 del; 22 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rmarchenko at openjdk.org Thu Jun 29 08:35:46 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 29 Jun 2023 08:35:46 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v12] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > There are the following possible scenarios for CRaC running in a container: > > // getpid CRaCMinPid | set_last_pid fork > // ------------------------------------------------ > // 1 - | yes (default) yes > // 1 1 | no yes > // 1 >1 | yes yes > // >1 - | no no > // >1 <=getpid | no no > // >1 getpid< | yes yes Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Added test case ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/9deb1152..260dc9cc Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=11 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=10-11 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From akozlov at openjdk.org Thu Jun 29 11:22:21 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 29 Jun 2023 11:22:21 GMT Subject: [crac] RFR: Fix JVMCI after #41 In-Reply-To: <-WR30gYJtf2g3rcDrRjciI5hog0ZXrxWK2PYz1OuSyY=.5478c7a5-c20d-4742-a33a-d86fe620d976@github.com> References: <-WR30gYJtf2g3rcDrRjciI5hog0ZXrxWK2PYz1OuSyY=.5478c7a5-c20d-4742-a33a-d86fe620d976@github.com> Message-ID: On Wed, 28 Jun 2023 12:13:16 GMT, Anton Kozlov wrote: > The recently added CPU_MAX feature went out of sync with JVMCI code [1]. Since this is not a real feature, but auxilary value, a distinct name for the maximum value solves the issue. > > [1] jtreg_test_hotspot_jtreg_tier1_compiler/compiler/jvmci/JVM_GetJVMCIRuntimeTest.jtr: > > jdk.vm.ci.common.JVMCIError: Missing CPU feature constants: [MAX] > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIBackendFactory.convertFeatures(HotSpotJVMCIBackendFactory.java:79) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.computeFeatures(AMD64HotSpotJVMCIBackendFactory.java:53) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.createTarget(AMD64HotSpotJVMCIBackendFactory.java:74) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.createJVMCIBackend(AMD64HotSpotJVMCIBackendFactory.java:109) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.(HotSpotJVMCIRuntime.java:549) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.runtime(HotSpotJVMCIRuntime.java:176) > at jdk.internal.vm.ci/jdk.vm.ci.runtime.JVMCI.initializeRuntime(Native Method) > at jdk.internal.vm.ci/jdk.vm.ci.runtime.JVMCI.getRuntime(JVMCI.java:65) > at compiler.jvmci.JVM_GetJVMCIRuntimeTest.run(JVM_GetJVMCIRuntimeTest.java:77) > at compiler.jvmci.JVM_GetJVMCIRuntimeTest.main(JVM_GetJVMCIRuntimeTest.java:70) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) > at java.base/java.lang.Thread.run(Thread.java:833) Thanks! ------------- PR Comment: https://git.openjdk.org/crac/pull/88#issuecomment-1612932582 From akozlov at openjdk.org Thu Jun 29 11:22:21 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 29 Jun 2023 11:22:21 GMT Subject: [crac] Integrated: Fix JVMCI after #41 In-Reply-To: <-WR30gYJtf2g3rcDrRjciI5hog0ZXrxWK2PYz1OuSyY=.5478c7a5-c20d-4742-a33a-d86fe620d976@github.com> References: <-WR30gYJtf2g3rcDrRjciI5hog0ZXrxWK2PYz1OuSyY=.5478c7a5-c20d-4742-a33a-d86fe620d976@github.com> Message-ID: On Wed, 28 Jun 2023 12:13:16 GMT, Anton Kozlov wrote: > The recently added CPU_MAX feature went out of sync with JVMCI code [1]. Since this is not a real feature, but auxilary value, a distinct name for the maximum value solves the issue. > > [1] jtreg_test_hotspot_jtreg_tier1_compiler/compiler/jvmci/JVM_GetJVMCIRuntimeTest.jtr: > > jdk.vm.ci.common.JVMCIError: Missing CPU feature constants: [MAX] > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIBackendFactory.convertFeatures(HotSpotJVMCIBackendFactory.java:79) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.computeFeatures(AMD64HotSpotJVMCIBackendFactory.java:53) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.createTarget(AMD64HotSpotJVMCIBackendFactory.java:74) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.amd64.AMD64HotSpotJVMCIBackendFactory.createJVMCIBackend(AMD64HotSpotJVMCIBackendFactory.java:109) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.(HotSpotJVMCIRuntime.java:549) > at jdk.internal.vm.ci/jdk.vm.ci.hotspot.HotSpotJVMCIRuntime.runtime(HotSpotJVMCIRuntime.java:176) > at jdk.internal.vm.ci/jdk.vm.ci.runtime.JVMCI.initializeRuntime(Native Method) > at jdk.internal.vm.ci/jdk.vm.ci.runtime.JVMCI.getRuntime(JVMCI.java:65) > at compiler.jvmci.JVM_GetJVMCIRuntimeTest.run(JVM_GetJVMCIRuntimeTest.java:77) > at compiler.jvmci.JVM_GetJVMCIRuntimeTest.main(JVM_GetJVMCIRuntimeTest.java:70) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) > at java.base/java.lang.Thread.run(Thread.java:833) This pull request has now been integrated. Changeset: a9f87061 Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/a9f87061492024457d65cc94ac8aa9eb102285d7 Stats: 12 lines in 2 files changed: 2 ins; 3 del; 7 mod Fix JVMCI after #41 Reviewed-by: rmarchenko ------------- PR: https://git.openjdk.org/crac/pull/88 From akozlov at openjdk.org Thu Jun 29 12:34:27 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 29 Jun 2023 12:34:27 GMT Subject: [crac] RFR: Selectable Global Context implementation In-Reply-To: References: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> Message-ID: On Thu, 29 Jun 2023 06:51:05 GMT, Radim Vansa wrote: >> I considered register() interface. The waitForReferenceProcessing is an internal JDK interface. The reference processing itself does not require the special managment on checkpoint, but it's checkpoint requirement as a procedure that no reference processing is in progress at some certain point. >> >> The class is a boilerplate code to call two methods, it is not shared and unlikely will, so a separate file is not justified. > > If it's a requirement (and not an optimization), how come it's not prevented from running? (the blocking was there for a time before you reverted that, wasn't it?) This piece of code is reletated to this part of the description > To make properties available, it was required to delay Reference resource registration. Otherwise, the GlobalContext implementation decision had to be done too early, in the Reference initialization during JDK bootstrapping, before Properties are available. Without this code, the ReferenceResource was registered during Reference initialization. At that point, Properties were not initalized yet, so any attempt to call System.getProperty() threw NPE (regardless parameters). I.e. if we have the default context implementation configuration via properties, we need to delay ReferenceResource registration. >> Since there is no way to employ BlockingContext other than as the global context implementation, I prefer to make it selectable. > > OK, so in the future you intend to just change the default and provide non-blocking implementation as the workaround for cases where the registration is not synchronized properly? In that case it makes sense... Exactly. At some point we'll set on a single, the best implementation. But after the experience, I think we'll need the configuration for a while even after we change the default. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1246553267 PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1246555823 From rmarchenko at openjdk.org Thu Jun 29 12:45:57 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 29 Jun 2023 12:45:57 GMT Subject: [crac] RFR: Extract crac functionality into OS-agnostic files [v7] In-Reply-To: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> References: <9ciP4AawRKr_fIjAGILl5CU_PDGZGL-jxqDKMaMQSGs=.28129bfd-9515-4afc-ac03-3cb8b192f398@github.com> Message-ID: > CRaC-related functionality is moved to `crac*.hpp/cpp` files, and now it is in `crac` class instead of `os`. Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Merge branch 'openjdk:crac' into crac-extract - Merge pull request #1 from jankratochvil/crac-extract-jan Fix compilation errors: - Fix compilation errors: commit 5d2fe3461534d56b0408788c8d7c1b94f85530c0 ../../src/hotspot/share/runtime/arguments.cpp: In static member function 'static jint Arguments::finalize_vm_init_args(bool)': ../../src/hotspot/share/runtime/arguments.cpp:3235:28: error: 'crac' has not been declared 3235 | if (CRaCCheckpointTo && !crac::prepare_checkpoint()) { | ^~~~ ../../src/hotspot/share/prims/jvm.cpp: In function '_jobjectArray* JVM_Checkpoint(JNIEnv*, jarray, jobjectArray, jboolean, jlong)': ../../src/hotspot/share/prims/jvm.cpp:3853:16: error: 'crac' has not been declared 3853 | Handle ret = crac::checkpoint(fd_arr, obj_arr, dry_run, jcmd_stream, CHECK_NULL); | ^~~~ ../../src/hotspot/share/services/management.cpp: In function 'jlong get_long_attribute(jmmLongAttribute)': ../../src/hotspot/share/services/management.cpp:957:12: error: 'crac' has not been declared 957 | return crac::restore_start_time(); | ^~~~ ../../src/hotspot/share/services/management.cpp:961:21: error: 'crac' has not been declared 961 | jlong ticks = crac::uptime_since_restore(); | ^~~~ Remove a no longer used declaration of os::Linux::checkpoint_restore(). - Merge branch 'crac' into crac-extract - Merge branch 'crac' into crac-extract - Merge branch 'crac' into crac-extract - Getting rid of crac::Linux - Removing trailing spaces - Refactoring - extracted crac* files ------------- Changes: https://git.openjdk.org/crac/pull/84/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=84&range=06 Stats: 2383 lines in 12 files changed: 1227 ins; 1149 del; 7 mod Patch: https://git.openjdk.org/crac/pull/84.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/84/head:pull/84 PR: https://git.openjdk.org/crac/pull/84 From rmarchenko at openjdk.org Thu Jun 29 12:47:21 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 29 Jun 2023 12:47:21 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v13] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > There are the following possible scenarios for CRaC running in a container: > > // getpid CRaCMinPid | set_last_pid fork > // ------------------------------------------------ > // 1 - | yes (default) yes > // 1 1 | no yes > // 1 >1 | yes yes > // >1 - | no no > // >1 <=getpid | no no > // >1 getpid< | yes yes Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 22 additional commits since the last revision: - Merge branch 'openjdk:crac' into pid-adjustment - Added test case - Merge branch 'pid1-tests' into pid-adjustment - Implemented running command in container directly - Added FIXME for further steps - Fixing review comments - Fixing review comments - Revert "Now CracMinPid option must be set explicitly to adjust PID" This reverts commit b3d66800d6ea441fb86498fdbb229400747eb44f. - Now CracMinPid option must be set explicitly to adjust PID - Adapting tests - ... and 12 more: https://git.openjdk.org/crac/compare/581706d9...b164f67b ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/260dc9cc..b164f67b Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=12 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=11-12 Stats: 12 lines in 2 files changed: 2 ins; 3 del; 7 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rvansa at openjdk.org Thu Jun 29 15:54:47 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 29 Jun 2023 15:54:47 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v8] In-Reply-To: References: Message-ID: > When the application does not close some file descriptors through Resources we can use `jdk.crac.file-policy.checkpoint`, `jdk.crac.file-policy.restore`, `jdk.crac.socket-policy.checkpoint` and `jdk.crac.socket-policy.checkpoint` to configure the behaviour. > > These properties can specify a list of semicolon-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * for `file-policy`, path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) - this matches named pipes, too > * for `file-policy` keyword `FIFO` matching all pipes (including anonymous) > * for `socket-policy` a `,` pair, with the `` part being optional. Both `` and `` can be unix socket path, IPv4/IPv6 address with optional colon and port number or wildcard `*` replacing any of those parts. > > The possible values are in OpenFilePolicies.BeforeCheckpoint, OpenFilePolicies.AfterRestore, OpenSocketPolicies.BeforeCheckpoint and OpenSocketPolicies.AfterRestore enums. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Merge remote-tracking branch 'origin/crac' into newfd-policies - Another rework, without native methods - Don't use numeric FDs, remote _AT_END policies - cleanup - Merge branch 'crac' into newfd-policies - Refactor FileDescriptor resource to separate class - Add REOPEN_AT_END and OPEN_OTHER_AT_END policies - Merge branch 'crac' into newfd-policies - Effectively revert previous commit: Initialize logger in - Simplify workarounds in SimpleConsoleLogger. - ... and 2 more: https://git.openjdk.org/crac/compare/a9f87061...a0075b13 ------------- Changes: https://git.openjdk.org/crac/pull/69/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=69&range=07 Stats: 2156 lines in 50 files changed: 2031 ins; 53 del; 72 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From rvansa at openjdk.org Thu Jun 29 16:09:58 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 29 Jun 2023 16:09:58 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v9] In-Reply-To: References: Message-ID: > When the application does not close some file descriptors through Resources we can use `jdk.crac.file-policy.checkpoint`, `jdk.crac.file-policy.restore`, `jdk.crac.socket-policy.checkpoint` and `jdk.crac.socket-policy.checkpoint` to configure the behaviour. > > These properties can specify a list of semicolon-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * for `file-policy`, path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) - this matches named pipes, too > * for `file-policy` keyword `FIFO` matching all pipes (including anonymous) > * for `socket-policy` a `,` pair, with the `` part being optional. Both `` and `` can be unix socket path, IPv4/IPv6 address with optional colon and port number or wildcard `*` replacing any of those parts. > > The possible values are in OpenFilePolicies.BeforeCheckpoint, OpenFilePolicies.AfterRestore, OpenSocketPolicies.BeforeCheckpoint and OpenSocketPolicies.AfterRestore enums. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Cosmetic fixes (imports) ------------- Changes: - all: https://git.openjdk.org/crac/pull/69/files - new: https://git.openjdk.org/crac/pull/69/files/a0075b13..a245477a Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=69&range=08 - incr: https://webrevs.openjdk.org/?repo=crac&pr=69&range=07-08 Stats: 34 lines in 8 files changed: 21 ins; 10 del; 3 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From rvansa at openjdk.org Thu Jun 29 16:12:24 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Thu, 29 Jun 2023 16:12:24 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v6] In-Reply-To: References: Message-ID: On Fri, 16 Jun 2023 13:33:37 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> cleanup > > Some random notes from first reading, I'll do another pass. This is not a request to change anything. @AntonKozlov Updated. * all the native code should be gone * the resources are using only existing methods * policies are in a single file (both files, sockets and pipes...), no separation to checkpoint and restore * removed the possibility to open another file instead ------------- PR Comment: https://git.openjdk.org/crac/pull/69#issuecomment-1613471442 From akozlov at openjdk.org Thu Jun 29 16:16:36 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 29 Jun 2023 16:16:36 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v13] In-Reply-To: References: Message-ID: On Thu, 29 Jun 2023 12:47:21 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. >> >> There are the following possible scenarios for CRaC running in a container: >> >> // getpid CRaCMinPid | set_last_pid fork >> // ------------------------------------------------ >> // 1 - | yes (default) yes >> // 1 1 | no yes >> // 1 >1 | yes yes >> // >1 - | no no >> // >1 <=getpid | no no >> // >1 getpid< | yes yes > > Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 22 additional commits since the last revision: > > - Merge branch 'openjdk:crac' into pid-adjustment > - Added test case > - Merge branch 'pid1-tests' into pid-adjustment > - Implemented running command in container directly > - Added FIXME for further steps > - Fixing review comments > - Fixing review comments > - Revert "Now CracMinPid option must be set explicitly to adjust PID" > > This reverts commit b3d66800d6ea441fb86498fdbb229400747eb44f. > - Now CracMinPid option must be set explicitly to adjust PID > - Adapting tests > - ... and 12 more: https://git.openjdk.org/crac/compare/c47fe131...b164f67b Code changes looks almost good, see the comment. test/jdk/jdk/crac/ContainerPidAdjustmentTest.java line 81: > 79: > 80: @Override > 81: // FIXME: need to add a test for default values, for Java's PID==1. Is it still valid? ------------- PR Review: https://git.openjdk.org/crac/pull/86#pullrequestreview-1502920184 PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1246854792 From akozlov at openjdk.org Thu Jun 29 16:16:41 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 29 Jun 2023 16:16:41 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v10] In-Reply-To: References: Message-ID: On Wed, 28 Jun 2023 08:54:41 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. >> >> There are the following possible scenarios for CRaC running in a container: >> >> // getpid CRaCMinPid | set_last_pid fork >> // ------------------------------------------------ >> // 1 - | yes (default) yes >> // 1 1 | no yes >> // 1 >1 | yes yes >> // >1 - | no no >> // >1 <=getpid | no no >> // >1 getpid< | yes yes > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Added FIXME for further steps src/java.base/share/native/launcher/main.c line 213: > 211: static void spin_last_pid(int pid) { > 212: const int MaxSpinCount = pid < 1000 ? 1000 : pid; > 213: for (int child = fork(), prev = 0, cnt = MaxSpinCount; child < pid; child = fork(), --cnt) { Since waitpid is called only if `child < pid`, does this mean the last child that satisfy pid requirement is left unwaited? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1245104030 From rmarchenko at openjdk.org Thu Jun 29 17:13:32 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 29 Jun 2023 17:13:32 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v10] In-Reply-To: References: Message-ID: <2OF97RyCQtGb_2MaJTZcOX7wIqTUYlz0HxuL-KAbets=.1f2f226f-0aea-4981-b8c3-30c35935b39b@github.com> On Wed, 28 Jun 2023 11:57:56 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Added FIXME for further steps > > src/java.base/share/native/launcher/main.c line 213: > >> 211: static void spin_last_pid(int pid) { >> 212: const int MaxSpinCount = pid < 1000 ? 1000 : pid; >> 213: for (int child = fork(), prev = 0, cnt = MaxSpinCount; child < pid; child = fork(), --cnt) { > > Since waitpid is called only if `child < pid`, does this mean the last child that satisfy pid requirement is left unwaited? Yes, you're correct. Do you think it's potentially dangerous or consumes resources? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1246915249 From rmarchenko at openjdk.org Thu Jun 29 17:13:35 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Thu, 29 Jun 2023 17:13:35 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v13] In-Reply-To: References: Message-ID: On Thu, 29 Jun 2023 16:12:21 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 22 additional commits since the last revision: >> >> - Merge branch 'openjdk:crac' into pid-adjustment >> - Added test case >> - Merge branch 'pid1-tests' into pid-adjustment >> - Implemented running command in container directly >> - Added FIXME for further steps >> - Fixing review comments >> - Fixing review comments >> - Revert "Now CracMinPid option must be set explicitly to adjust PID" >> >> This reverts commit b3d66800d6ea441fb86498fdbb229400747eb44f. >> - Now CracMinPid option must be set explicitly to adjust PID >> - Adapting tests >> - ... and 12 more: https://git.openjdk.org/crac/compare/3b3ab11d...b164f67b > > test/jdk/jdk/crac/ContainerPidAdjustmentTest.java line 81: > >> 79: >> 80: @Override >> 81: // FIXME: need to add a test for default values, for Java's PID==1. > > Is it still valid? No, I'll remove this. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1246916170 From rvansa at openjdk.org Fri Jun 30 06:42:20 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 30 Jun 2023 06:42:20 GMT Subject: [crac] RFR: Selectable Global Context implementation In-Reply-To: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> References: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> Message-ID: <8n9kK5D7PrNcB9xKgG4GuJOdZXLnb_tk7eKBlfvyyX4=.64c1d741-2f68-4941-a5d4-cc0a731a84c5@github.com> On Mon, 26 Jun 2023 16:12:24 GMT, Anton Kozlov wrote: > A number of problems were caused by switching to BlockingOrderedContext in the JDK code. To prevent the same in the potential users of EA builds, I temporarily switch the default implementation to the previous OrderedContext. The BlockingContext is still available via a command line option. After an EA build and transition of CRaC development to track newer version openjdk/jdk, the default implementation will be changed. > > To make properties available, it was required to delay Reference resource registration. Otherwise, the GlobalContext implementation decision had to be done too early, in the Reference initialization during JDK bootstrapping, before Properties are available. Marked as reviewed by rvansa (Committer). ------------- PR Review: https://git.openjdk.org/crac/pull/87#pullrequestreview-1506598767 From rvansa at openjdk.org Fri Jun 30 06:42:21 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 30 Jun 2023 06:42:21 GMT Subject: [crac] RFR: Selectable Global Context implementation In-Reply-To: References: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> Message-ID: On Thu, 29 Jun 2023 12:29:26 GMT, Anton Kozlov wrote: >> If it's a requirement (and not an optimization), how come it's not prevented from running? (the blocking was there for a time before you reverted that, wasn't it?) > > This piece of code is reletated to this part of the description > >> To make properties available, it was required to delay Reference resource registration. Otherwise, the GlobalContext implementation decision had to be done too early, in the Reference initialization during JDK bootstrapping, before Properties are available. > > Without this code, the ReferenceResource was registered during Reference initialization. At that point, Properties were not initalized yet, so any attempt to call System.getProperty() threw NPE (regardless parameters). > > I.e. if we have the default context implementation configuration via properties, we need to delay ReferenceResource registration. I understand moving the registration from Reference clinit, and if you'd rather keep the implementation here, let's keep it that way. I was referring to > it's checkpoint requirement as a procedure that no reference processing is in progress at some certain point - I am not sure what creates this requirement, and it is not 100% guaranteed the way it's implemented now. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1247492822 From rmarchenko at openjdk.org Fri Jun 30 08:00:55 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 30 Jun 2023 08:00:55 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v14] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > There are the following possible scenarios for CRaC running in a container: > > // getpid CRaCMinPid | set_last_pid fork > // ------------------------------------------------ > // 1 - | yes (default) yes > // 1 1 | no yes > // 1 >1 | yes yes > // >1 - | no no > // >1 <=getpid | no no > // >1 getpid< | yes yes Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Removed unnecessary FIXME ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/b164f67b..a1410c9c Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=13 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=12-13 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From rvansa at openjdk.org Fri Jun 30 08:08:16 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 30 Jun 2023 08:08:16 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v4] In-Reply-To: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: > This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. > > This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. > > This commit does not handle timed waiting in non-Java threads other than WatcherThread. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge remote-tracking branch 'origin/crac' into timed_wait - Don't use OS thread state - Inline Thread::interrupt - Use Threads::java_threads_do, check spurious timeouts - Wake up all TIMED_WAITING threads after restore Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_wait(). The specification permits spurious wakeups; implementation either handles that transparently or propagates the wakeup. This commit does not handle timed waiting in non-Java threads other than WatcherThread. ------------- Changes: - all: https://git.openjdk.org/crac/pull/85/files - new: https://git.openjdk.org/crac/pull/85/files/df9ed092..4ddd1f05 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=85&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=85&range=02-03 Stats: 1081 lines in 17 files changed: 1047 ins; 10 del; 24 mod Patch: https://git.openjdk.org/crac/pull/85.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/85/head:pull/85 PR: https://git.openjdk.org/crac/pull/85 From rvansa at openjdk.org Fri Jun 30 08:08:18 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 30 Jun 2023 08:08:18 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Mon, 19 Jun 2023 15:52:59 GMT, Anton Kozlov wrote: >> This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. >> >> This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. >> >> This commit does not handle timed waiting in non-Java threads other than WatcherThread. > >> Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_wait(). > > Probably pthread_cond_timedwait() ? Should not be other *_timedwait be handled, like sem_timedwait()? > >> When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. > > I assume this happens in the implementation of pthread_cond_timedwait and not the JVM caller function? Otherwise I can't see the code doing the recalculation. > >> This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_timed_wait. >> Implementation either handles that transparently or propagates the wakeup to Java. > > OS-level spurious wake-ups are fine, but propogating that to java level seems dangerous. I could not spot the example of the propogation. E.g. can that happen to Thread.sleep()? @AntonKozlov Updated, removing the OS thread state and using `as_Java_thread()` ------------- PR Comment: https://git.openjdk.org/crac/pull/85#issuecomment-1614284093 From rmarchenko at openjdk.org Fri Jun 30 10:16:35 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 30 Jun 2023 10:16:35 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v15] In-Reply-To: References: Message-ID: <1iTxsU8QJ-Dwe25gYSWjDdq05mjoGy9nCBConNAekkk=.fceca535-3fd1-4ef2-ad63-86fdcdc865e4@github.com> > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > There are the following possible scenarios for CRaC running in a container: > > // getpid CRaCMinPid | set_last_pid fork > // ------------------------------------------------ > // 1 - | yes (default) yes > // 1 1 | no yes > // 1 >1 | yes yes > // >1 - | no no > // >1 <=getpid | no no > // >1 getpid< | yes yes Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Adapting ResolveTest ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/a1410c9c..9b32061f Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=14 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=13-14 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From jkratochvil at openjdk.org Fri Jun 30 10:26:31 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Fri, 30 Jun 2023 10:26:31 GMT Subject: [crac] RFR: Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake Message-ID: SSIA ------------- Commit messages: - Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake Changes: https://git.openjdk.org/crac/pull/89/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=89&range=00 Stats: 8 lines in 1 file changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/89.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/89/head:pull/89 PR: https://git.openjdk.org/crac/pull/89 From akozlov at openjdk.org Fri Jun 30 12:06:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 30 Jun 2023 12:06:23 GMT Subject: [crac] RFR: Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake In-Reply-To: References: Message-ID: On Fri, 30 Jun 2023 10:19:20 GMT, Jan Kratochvil wrote: > SSIA Marked as reviewed by akozlov (Lead). src/hotspot/cpu/x86/vm_version_x86.cpp line 2597: > 2595: > 2596: // Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake. > 2597: features_missing &= ~CPU_HT; Note: I just realized these missing features are different from the others, it's more "extra features" -- they are specified via the option, but not available on the current CPU. ------------- PR Review: https://git.openjdk.org/crac/pull/89#pullrequestreview-1507067082 PR Review Comment: https://git.openjdk.org/crac/pull/89#discussion_r1247787447 From jkratochvil at openjdk.org Fri Jun 30 12:13:24 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Fri, 30 Jun 2023 12:13:24 GMT Subject: [crac] RFR: Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake In-Reply-To: References: Message-ID: On Fri, 30 Jun 2023 12:03:51 GMT, Anton Kozlov wrote: >> SSIA > > src/hotspot/cpu/x86/vm_version_x86.cpp line 2597: > >> 2595: >> 2596: // Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake. >> 2597: features_missing &= ~CPU_HT; > > Note: I just realized these missing features are different from the others, it's more "extra features" -- they are specified via the option, but not available on the current CPU. I am not sure what you mean here. This part of the code checks that the `-XX:CPUFeatures=0x...,0x...` specifies only a subset of the current options, never superset. Or do you have a suspection there exist some other options besides `CPU_HT` which are not fatal to use even if the CPU does not support them? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/89#discussion_r1247792981 From akozlov at openjdk.org Fri Jun 30 12:18:30 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 30 Jun 2023 12:18:30 GMT Subject: [crac] RFR: Selectable Global Context implementation In-Reply-To: References: <0CYcbVOeKGGd2yN3N3VqFbCYVRgr8_PxXWdUE4QDDzQ=.5c5630b6-10ec-443f-b3a1-ed5af9336ee0@github.com> Message-ID: On Fri, 30 Jun 2023 06:38:36 GMT, Radim Vansa wrote: >> This piece of code is reletated to this part of the description >> >>> To make properties available, it was required to delay Reference resource registration. Otherwise, the GlobalContext implementation decision had to be done too early, in the Reference initialization during JDK bootstrapping, before Properties are available. >> >> Without this code, the ReferenceResource was registered during Reference initialization. At that point, Properties were not initalized yet, so any attempt to call System.getProperty() threw NPE (regardless parameters). >> >> I.e. if we have the default context implementation configuration via properties, we need to delay ReferenceResource registration. > > I understand moving the registration from Reference clinit, and if you'd rather keep the implementation here, let's keep it that way. I was referring to > >> it's checkpoint requirement as a procedure that no reference processing is in progress at some certain point > > - I am not sure what creates this requirement, and it is not 100% guaranteed the way it's implemented now. The requirement is stated in the quoted part :) "no reference processing is in progress at some certain point" =(at REFERENCE_HANDLER priority). It's required by the following CLEANERS (hard requirement, otherwise cleaner may miss something). And in general, reference processing should be quiescent -- to avoid checkpointing reference processing in the middle without a reason (soft requirement). I don't like how it's done at the moment. I've tried to make CLEANERS to ensure reference processing by its own, to be more straightforward. But in the absence of any CLEANER (e.g. the whole Cleaner package is not referenced and is not initialized), we won't satisfy the soft requirement, which still makes sense. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/87#discussion_r1247797235 From akozlov at openjdk.org Fri Jun 30 12:21:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 30 Jun 2023 12:21:23 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v10] In-Reply-To: <2OF97RyCQtGb_2MaJTZcOX7wIqTUYlz0HxuL-KAbets=.1f2f226f-0aea-4981-b8c3-30c35935b39b@github.com> References: <2OF97RyCQtGb_2MaJTZcOX7wIqTUYlz0HxuL-KAbets=.1f2f226f-0aea-4981-b8c3-30c35935b39b@github.com> Message-ID: <6ifvHEaNx1QGCNgJ8R4U-KbgmsBy6KuRgHWS_M6O7eU=.2fcd1554-450e-480f-9615-fcad2e51e291@github.com> On Thu, 29 Jun 2023 17:10:00 GMT, Roman Marchenko wrote: >> src/java.base/share/native/launcher/main.c line 213: >> >>> 211: static void spin_last_pid(int pid) { >>> 212: const int MaxSpinCount = pid < 1000 ? 1000 : pid; >>> 213: for (int child = fork(), prev = 0, cnt = MaxSpinCount; child < pid; child = fork(), --cnt) { >> >> Since waitpid is called only if `child < pid`, does this mean the last child that satisfy pid requirement is left unwaited? > > Yes, you're correct. > Do you think it's potentially dangerous or consumes resources? Without wait, the child will occupy the PID number forever, without any need for that. So that's wrong. May be we can simplify the loop, so we won't leak the PID? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1247799207 From rvansa at openjdk.org Fri Jun 30 12:27:17 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 30 Jun 2023 12:27:17 GMT Subject: [crac] RFR: Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake In-Reply-To: References: Message-ID: On Fri, 30 Jun 2023 10:19:20 GMT, Jan Kratochvil wrote: > SSIA The testsuite passes on my Alder Lake (i7-12700H) ------------- Marked as reviewed by rvansa (Committer). PR Review: https://git.openjdk.org/crac/pull/89#pullrequestreview-1507095866 From akozlov at openjdk.org Fri Jun 30 12:58:20 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 30 Jun 2023 12:58:20 GMT Subject: [crac] RFR: Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake In-Reply-To: References: Message-ID: On Fri, 30 Jun 2023 12:10:42 GMT, Jan Kratochvil wrote: > This part of the code checks that the -XX:CPUFeatures=0x...,0x... specifies only a subset of the current options, never superset. Good, that was my understanding as well. I mean just the variables names are the same, but the meaning is different. But I don't suppose it should be fixed. The PR is good. Thanks! ------------- PR Review Comment: https://git.openjdk.org/crac/pull/89#discussion_r1247831808 From rmarchenko at openjdk.org Fri Jun 30 13:20:58 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 30 Jun 2023 13:20:58 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v16] In-Reply-To: References: Message-ID: > On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. > > See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. > > This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. > > There are the following possible scenarios for CRaC running in a container: > > // getpid CRaCMinPid | set_last_pid fork > // ------------------------------------------------ > // 1 - | yes (default) yes > // 1 1 | no yes > // 1 >1 | yes yes > // >1 - | no no > // >1 <=getpid | no no > // >1 getpid< | yes yes Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Waiting for the last child while spinning PID ------------- Changes: - all: https://git.openjdk.org/crac/pull/86/files - new: https://git.openjdk.org/crac/pull/86/files/9b32061f..e95cb986 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=86&range=15 - incr: https://webrevs.openjdk.org/?repo=crac&pr=86&range=14-15 Stats: 7 lines in 1 file changed: 5 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/86.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/86/head:pull/86 PR: https://git.openjdk.org/crac/pull/86 From akozlov at openjdk.org Fri Jun 30 13:25:39 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 30 Jun 2023 13:25:39 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v16] In-Reply-To: References: Message-ID: On Fri, 30 Jun 2023 13:20:58 GMT, Roman Marchenko wrote: >> On restore, there might be PID value conflicts because of small PID values, if it was checkpoint'ed in a container. Therefore, when checkpointing in a container, we need to move PID value for new processes to a particular value to avoid conflicts on restore. >> >> See https://github.com/CRaC/example-lambda/blob/master/checkpoint.cmd.sh#L8 for example. >> >> This PR contains implemented functionality similar to the example above, making this work out of the box. By default, if checkpointing, PID is adjusted only if Java's PID is 1 that means Java is run in a container. To adjust PID manually for a checkpoint'ed process, `-XX:CRaCMinPid=` option should be used along with `CRaCCheckpointTo`. Min `CRaCMinPid` value is 1, max `CRaCMinPid` value is `UINT_MAX`, but it is actually limited by OS's pid_max. >> >> There are the following possible scenarios for CRaC running in a container: >> >> // getpid CRaCMinPid | set_last_pid fork >> // ------------------------------------------------ >> // 1 - | yes (default) yes >> // 1 1 | no yes >> // 1 >1 | yes yes >> // >1 - | no no >> // >1 <=getpid | no no >> // >1 getpid< | yes yes > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Waiting for the last child while spinning PID Thanks, LGTM! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/86#pullrequestreview-1507188765 From rmarchenko at openjdk.org Fri Jun 30 13:39:25 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 30 Jun 2023 13:39:25 GMT Subject: [crac] RFR: PID adjustment on checkpoint [v10] In-Reply-To: <6ifvHEaNx1QGCNgJ8R4U-KbgmsBy6KuRgHWS_M6O7eU=.2fcd1554-450e-480f-9615-fcad2e51e291@github.com> References: <2OF97RyCQtGb_2MaJTZcOX7wIqTUYlz0HxuL-KAbets=.1f2f226f-0aea-4981-b8c3-30c35935b39b@github.com> <6ifvHEaNx1QGCNgJ8R4U-KbgmsBy6KuRgHWS_M6O7eU=.2fcd1554-450e-480f-9615-fcad2e51e291@github.com> Message-ID: On Fri, 30 Jun 2023 12:18:31 GMT, Anton Kozlov wrote: >> Yes, you're correct. >> Do you think it's potentially dangerous or consumes resources? > > Without wait, the child will occupy the PID number forever, without any need for that. So that's wrong. May be we can simplify the loop, so we won't leak the PID? Done ------------- PR Review Comment: https://git.openjdk.org/crac/pull/86#discussion_r1247874566 From jkratochvil at openjdk.org Fri Jun 30 14:20:23 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Fri, 30 Jun 2023 14:20:23 GMT Subject: [crac] Integrated: Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake In-Reply-To: References: Message-ID: On Fri, 30 Jun 2023 10:19:20 GMT, Jan Kratochvil wrote: > SSIA This pull request has now been integrated. Changeset: 74f047c5 Author: Jan Kratochvil Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/74f047c5bc7a7d5dc59d8a44c5e419ed7fe76db0 Stats: 8 lines in 1 file changed: 8 ins; 0 del; 0 mod Workaround JDK-8311164: CPU_HT is set randomly on hybrid CPUs like Alder Lake Reviewed-by: akozlov, rvansa ------------- PR: https://git.openjdk.org/crac/pull/89 From akozlov at openjdk.org Fri Jun 30 14:22:22 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 30 Jun 2023 14:22:22 GMT Subject: [crac] RFR: Wake up all TIMED_WAITING threads after restore [v4] In-Reply-To: References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Fri, 30 Jun 2023 08:08:16 GMT, Radim Vansa wrote: >> This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. >> >> This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. >> >> This commit does not handle timed waiting in non-Java threads other than WatcherThread. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge remote-tracking branch 'origin/crac' into timed_wait > - Don't use OS thread state > - Inline Thread::interrupt > - Use Threads::java_threads_do, check spurious timeouts > - Wake up all TIMED_WAITING threads after restore > > Threads that enter sleep or timed parking use absolute monotonic time > for pthread_cond_wait(). The specification permits spurious wakeups; > implementation either handles that transparently or propagates > the wakeup. > > This commit does not handle timed waiting in non-Java threads other than > WatcherThread. LGTM, thank you! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/85#pullrequestreview-1507288596 From rvansa at openjdk.org Fri Jun 30 15:38:28 2023 From: rvansa at openjdk.org (Radim Vansa) Date: Fri, 30 Jun 2023 15:38:28 GMT Subject: [crac] Integrated: Wake up all TIMED_WAITING threads after restore In-Reply-To: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> References: <7_xPH0xZdNTGBMiSBNgJE39bYB8Y41We3es0dOf7hrI=.6067ac3c-90b0-4dd3-9322-c39f652fb2bf@github.com> Message-ID: On Mon, 19 Jun 2023 14:46:35 GMT, Radim Vansa wrote: > This is a fix for an issue found by @jankratochvil when testing #53: Threads that enter sleep or timed parking use absolute monotonic time for pthread_cond_timedwait(). When the monotonic time changes during C/R we need to wake all threads to readjust the timeout to the new absolute time. > > This introduces effectively a spurious wakeup; this is permitted for all the uses of pthread_cond_timedwait. Implementation either handles that transparently or propagates the wakeup to Java. > > This commit does not handle timed waiting in non-Java threads other than WatcherThread. This pull request has now been integrated. Changeset: 7d3e7bfe Author: Radim Vansa URL: https://git.openjdk.org/crac/commit/7d3e7bfe63ed07f533923805e0bfccbb20325348 Stats: 249 lines in 4 files changed: 249 ins; 0 del; 0 mod Wake up all TIMED_WAITING threads after restore Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/85