From duke at openjdk.org Wed Mar 1 08:59:33 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Wed, 1 Mar 2023 08:59:33 GMT Subject: [crac] RFR: Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page Message-ID: Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page. ------------- Commit messages: - Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page. Changes: https://git.openjdk.org/crac/pull/49/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=49&range=00 Stats: 17 lines in 1 file changed: 17 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/49.diff Fetch: git fetch https://git.openjdk.org/crac pull/49/head:pull/49 PR: https://git.openjdk.org/crac/pull/49 From akozlov at openjdk.org Wed Mar 1 08:59:33 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 1 Mar 2023 08:59:33 GMT Subject: [crac] RFR: Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page In-Reply-To: References: Message-ID: On Thu, 23 Feb 2023 11:42:21 GMT, Jan Kratochvil wrote: > Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page. LGTM, thanks! But could you satisfy Skara check by providing a description for the PR? ------------- PR: https://git.openjdk.org/crac/pull/49 From duke at openjdk.org Wed Mar 1 11:49:41 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Wed, 1 Mar 2023 11:49:41 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v8] In-Reply-To: References: Message-ID: On Tue, 28 Feb 2023 17:28:52 GMT, Anton Kozlov wrote: > I think in this PR we can concentrate on CPU features, as CPU core number is a different problem, that can arise even with the same feature set. I could split the patch but it is not testable/usable without the CPU count fix/hack. But I am now preparing the IFUNC patch for glibc upstreaming, whether it will be accepted or not. As long as you do not want a temporary solution in CRaC we can suspend this patch until glibc upstreaming gets resolved. ------------- PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Wed Mar 1 14:50:51 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 1 Mar 2023 14:50:51 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA In-Reply-To: References: Message-ID: On Wed, 22 Feb 2023 15:08:29 GMT, Radim Vansa wrote: > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Sure. Should I really rebase, or merge `crac` into this branch instead? ------------- PR: https://git.openjdk.org/crac/pull/48 From akozlov at openjdk.org Wed Mar 1 15:28:10 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 1 Mar 2023 15:28:10 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA In-Reply-To: References: Message-ID: On Wed, 22 Feb 2023 15:08:29 GMT, Radim Vansa wrote: > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Please use merge. This will preserve all changes in this PR, and on the eventual integration Skara will squash all changes anyway. ------------- PR: https://git.openjdk.org/crac/pull/48 From rmarchenko at openjdk.org Wed Mar 1 16:10:49 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 1 Mar 2023 16:10:49 GMT Subject: [crac] RFR: CRaC may exit before image dump is completed [v3] In-Reply-To: References: Message-ID: > When running CRaC with docker, java may exit before CRIU is finished dumpring because CRIU kills the original java process, and then docker immediately exits. > > It could be reproduced with a simple Java test: > > public class Test { > public static void main(String args[]) throws Exception { > jdk.crac.Core.checkpointRestore(); > System.out.println("finish"); > } > } > > and run with docker: > `docker $JAVA_HOME/java -XX:CRaCCheckpointTo=./cr Test.java` > > After the command above finishes, `cr/cppath` is absent in the case of failure. Or/and it will fail on restore: > `docker $JAVA_HOME/java -XX:CRaCRestoreFrom=./cr` > > This change fixes the issue by forkin'g the main process in case of PID=1 (pid=1 means it was run with docker), and waiting for children processes are finished. This makes us sure that CRIU finalized the dump, if any. At the same time, there is no conflict with PIDs on restore, since the process being restored has PID not equal to 1, if restoring with the command above.. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Returning status of the child only ------------- Changes: - all: https://git.openjdk.org/crac/pull/46/files - new: https://git.openjdk.org/crac/pull/46/files/fd074cfc..a0a78087 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=46&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=46&range=01-02 Stats: 5 lines in 1 file changed: 2 ins; 2 del; 1 mod Patch: https://git.openjdk.org/crac/pull/46.diff Fetch: git fetch https://git.openjdk.org/crac pull/46/head:pull/46 PR: https://git.openjdk.org/crac/pull/46 From akozlov at openjdk.org Wed Mar 1 16:11:14 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 1 Mar 2023 16:11:14 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v8] In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 11:46:18 GMT, Jan Kratochvil wrote: > I could split the patch but it is not testable/usable without the CPU count fix/hack. Maybe getting CPU count change first is a way to go? I understand that migration between different CPU features depends on CPU count change, but the CPU count is isolated and can be done upfront, right? Since we are waiting for glibc anyway for CPU features. But still we can meet with a different number of CPUs with the same feature set. Does the CPU count problem appear if a container is started with different quotas, like docker's `--cpu`? [1] Or does it require a different number of physical CPU cores? https://docs.docker.com/config/containers/resource_constraints/#cpu ------------- PR: https://git.openjdk.org/crac/pull/41 From akozlov at openjdk.org Wed Mar 1 16:18:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 1 Mar 2023 16:18:56 GMT Subject: [crac] RFR: Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page In-Reply-To: References: Message-ID: On Thu, 23 Feb 2023 11:42:21 GMT, Jan Kratochvil wrote: > Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page. Thanks! ------------- Marked as reviewed by akozlov (Lead). PR: https://git.openjdk.org/crac/pull/49 From duke at openjdk.org Wed Mar 1 16:20:43 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 1 Mar 2023 16:20:43 GMT Subject: [crac] Integrated: Fix failing CRaC tests In-Reply-To: References: Message-ID: On Wed, 22 Feb 2023 13:04:00 GMT, Radim Vansa wrote: > Before https://github.com/openjdk/crac/pull/16 the arguments for restored process were ignored; tests were written with this in mind and current behaviour breaks them. > > In addition I've added missing `-XX:+UnlockDiagnosticVMOptions` flag and fixed the `recursiveCheckpoint` test runner which got stuck as with `pauseengine` the checkpointed process does not terminate. This pull request has now been integrated. Changeset: 80cab698 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/80cab698b2fc3877f563f2e72aa728f47b76b11c Stats: 42 lines in 18 files changed: 3 ins; 4 del; 35 mod Fix failing CRaC tests Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/47 From akozlov at openjdk.org Wed Mar 1 16:34:41 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 1 Mar 2023 16:34:41 GMT Subject: [crac] RFR: CRaC may exit before image dump is completed [v3] In-Reply-To: References: Message-ID: <_lLFnw-pmrBhAQCc7tE3ScOUC_xYlP7FTSfHLNk5Z8Y=.efb73ac6-4f06-4b41-ac49-034660a7272d@github.com> On Wed, 1 Mar 2023 16:10:49 GMT, Roman Marchenko wrote: >> When running CRaC with docker, java may exit before CRIU is finished dumpring because CRIU kills the original java process, and then docker immediately exits. >> >> It could be reproduced with a simple Java test: >> >> public class Test { >> public static void main(String args[]) throws Exception { >> jdk.crac.Core.checkpointRestore(); >> System.out.println("finish"); >> } >> } >> >> and run with docker: >> `docker $JAVA_HOME/java -XX:CRaCCheckpointTo=./cr Test.java` >> >> After the command above finishes, `cr/cppath` is absent in the case of failure. Or/and it will fail on restore: >> `docker $JAVA_HOME/java -XX:CRaCRestoreFrom=./cr` >> >> This change fixes the issue by forkin'g the main process in case of PID=1 (pid=1 means it was run with docker), and waiting for children processes are finished. This makes us sure that CRIU finalized the dump, if any. At the same time, there is no conflict with PIDs on restore, since the process being restored has PID not equal to 1, if restoring with the command above.. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Returning status of the child only src/java.base/share/native/launcher/main.c line 120: > 118: pid = wait(&st); > 119: if (pid == g_child_pid && WIFEXITED(st)) { > 120: status = WEXITSTATUS(st); Sorry for nit-picking, but now if the java was killed (`WIFEXITED == false`) we won't update status and will return `0`, which does not look correct. `restorewait` in this situation returns `1` [1], although better, also does not look perfect. Here I suggest be at least consistent with restorewait. Or we can fix restorewait as well, indicating being killed by returning `128+signal`, as described in the bash manual [2]. How does it sound? > When a command terminates on a fatal signal N, bash uses the value of 128+N as the exit status. [1] https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/criuengine/criuengine.c#L306 [2] https://linux.die.net/man/1/bash ------------- PR: https://git.openjdk.org/crac/pull/46 From duke at openjdk.org Thu Mar 2 13:49:22 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 2 Mar 2023 13:49:22 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v9] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 > > That IMO does not preclude trying the same for this case. > > - Debian 11 x86_64: It does not work, glibc is too different and inlined there. > - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. > - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: src/hotspot/os/linux/os_linux_ifunc.cpp: Fix an assertion not really checking anything. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/dd60871e..8fdb17d2 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=08 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=07-08 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Thu Mar 2 15:20:54 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 2 Mar 2023 15:20:54 GMT Subject: [crac] Integrated: Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page In-Reply-To: References: Message-ID: <_VhS5m99KK27XYCFNyeDEZ8Bioj-zJhghRPqYnoK2v0=.f30444d4-d3c8-4e26-bfbc-3bf67f1721af@github.com> On Thu, 23 Feb 2023 11:42:21 GMT, Jan Kratochvil wrote: > Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page. This pull request has now been integrated. Changeset: dff32bca Author: Jan Kratochvil Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/dff32bcaf423eb9dfd339e2d043e8f412c901de0 Stats: 17 lines in 1 file changed: 17 ins; 0 del; 0 mod Document CRaCCheckpointTo and CRaCRestoreFrom in java(1) man page Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/49 From duke at openjdk.org Thu Mar 2 16:52:51 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 2 Mar 2023 16:52:51 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java Message-ID: Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 Please see `CracTest` javadoc for detailed info. ------------- Commit messages: - Merge remote-tracking branch 'origin/crac' into test-crac-java - Remove test name from the @run JTreg tag - Use default main and args from CracTest - Merge remote-tracking branch 'origin/crac' into test-crac-java - Add docker to CracBuilder - Rename enum for `simengine` to SIMULATE - Convert CRaC tests from shell scripts to Java - Remove somebody's forgotten overrides - Fix failing CRaC tests Changes: https://git.openjdk.org/crac/pull/50/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=50&range=00 Stats: 3582 lines in 54 files changed: 1683 ins; 1680 del; 219 mod Patch: https://git.openjdk.org/crac/pull/50.diff Fetch: git fetch https://git.openjdk.org/crac pull/50/head:pull/50 PR: https://git.openjdk.org/crac/pull/50 From akozlov at openjdk.org Thu Mar 2 16:52:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 2 Mar 2023 16:52:53 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java In-Reply-To: References: Message-ID: On Fri, 24 Feb 2023 09:20:05 GMT, Radim Vansa wrote: > Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 > > Please see `CracTest` javadoc for detailed info. This looks very nice! While this is a draft, what should be especially looked? As an approach, this is what these tests were missing! test/jdk/jdk/crac/LazyProps.java line 37: > 35: public static void main(String[] args) throws Exception { > 36: CracTest.run(LazyProps.class, args); > 37: } I assume this is the pattern that every test should be following. Is it possible to avoid this boiler plate by e.g. @run driver CracTest LazyProps and by making CracTest.main to look for the class? test/jdk/jdk/crac/LazyProps.java line 41: > 39: @Override > 40: public void test() throws Exception { > 41: new CracBuilder().engine(CracEngine.SIMULATE).main(LazyProps.class).args(CracTest.args()) `main(LazyProps.class).args(CracTest.args())` seems to be repeated, would it harm to make them default? test/jdk/jdk/crac/LazyProps.java line 42: > 40: public void test() throws Exception { > 41: new CracBuilder().engine(CracEngine.SIMULATE).main(LazyProps.class).args(CracTest.args()) > 42: .captureOutput(true) the same question, would it harm to make this default? Or is it false by default in the existing JDK testing libraries? ------------- PR: https://git.openjdk.org/crac/pull/50 From duke at openjdk.org Thu Mar 2 16:52:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 2 Mar 2023 16:52:54 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java In-Reply-To: References: Message-ID: On Tue, 28 Feb 2023 18:18:34 GMT, Anton Kozlov wrote: >> Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 >> >> Please see `CracTest` javadoc for detailed info. > > test/jdk/jdk/crac/LazyProps.java line 37: > >> 35: public static void main(String[] args) throws Exception { >> 36: CracTest.run(LazyProps.class, args); >> 37: } > > I assume this is the pattern that every test should be following. Is it possible to avoid this boiler plate by e.g. > > @run driver CracTest LazyProps > > and by making CracTest.main to look for the class? Good idea! From docs I didn't really get what's the difference between `@run driver` and `@run main`, do you know? > test/jdk/jdk/crac/LazyProps.java line 41: > >> 39: @Override >> 40: public void test() throws Exception { >> 41: new CracBuilder().engine(CracEngine.SIMULATE).main(LazyProps.class).args(CracTest.args()) > > `main(LazyProps.class).args(CracTest.args())` seems to be repeated, would it harm to make them default? I kind of thought about not having CracBuilder and CracTest interdependent but it's probably ok (and reduces boilerplate), so I'll do that. > test/jdk/jdk/crac/LazyProps.java line 42: > >> 40: public void test() throws Exception { >> 41: new CracBuilder().engine(CracEngine.SIMULATE).main(LazyProps.class).args(CracTest.args()) >> 42: .captureOutput(true) > > the same question, would it harm to make this default? Or is it false by default in the existing JDK testing libraries? Many tests don't use process output, and it was useful for debugging to have it printed straight out. If the process gets stuck I won't see that (right now we process the output only after it exits) but there's probably a way to both process it and print it out. ------------- PR: https://git.openjdk.org/crac/pull/50 From akozlov at openjdk.org Thu Mar 2 16:52:55 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 2 Mar 2023 16:52:55 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 07:52:19 GMT, Radim Vansa wrote: >> test/jdk/jdk/crac/LazyProps.java line 37: >> >>> 35: public static void main(String[] args) throws Exception { >>> 36: CracTest.run(LazyProps.class, args); >>> 37: } >> >> I assume this is the pattern that every test should be following. Is it possible to avoid this boiler plate by e.g. >> >> @run driver CracTest LazyProps >> >> and by making CracTest.main to look for the class? > > Good idea! From docs I didn't really get what's the difference between `@run driver` and `@run main`, do you know? I have not tried that myself, but from jtreg FAQ [1] > ... `@run driver`. This is the same as `@run main` with the exception that any VM options specified on the command line will not be used when running the specified class. Since we won't provide VM options, AFAIU there is no effective difference, but using `driver` specifies the intent better. [1] https://openjdk.org/jtreg/faq.html >> test/jdk/jdk/crac/LazyProps.java line 42: >> >>> 40: public void test() throws Exception { >>> 41: new CracBuilder().engine(CracEngine.SIMULATE).main(LazyProps.class).args(CracTest.args()) >>> 42: .captureOutput(true) >> >> the same question, would it harm to make this default? Or is it false by default in the existing JDK testing libraries? > > Many tests don't use process output, and it was useful for debugging to have it printed straight out. If the process gets stuck I won't see that (right now we process the output only after it exits) but there's probably a way to both process it and print it out. OK, thanks, something like this perfectly describes why it should not be default. Being explicit is not bad. ------------- PR: https://git.openjdk.org/crac/pull/50 From duke at openjdk.org Thu Mar 2 16:53:55 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 2 Mar 2023 16:53:55 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java In-Reply-To: References: Message-ID: On Tue, 28 Feb 2023 18:35:27 GMT, Anton Kozlov wrote: >> Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 >> >> Please see `CracTest` javadoc for detailed info. > > This looks very nice! While this is a draft, what should be especially looked? As an approach, this is what these tests were missing! @AntonKozlov I've merged with recent `crac`, also updating the tests integrated in the meantime and adding more goodies for Docker. The main and args parameters to CracBuilder now use test defaults, and `@run driver jdk.test.lib.crac.CracTest` is used. The class name is omitted; it can be discovered through `test.file` system property. Regrettably I have to add the `@build MyTest` tag, though, because (even if I had it in `@run driver`) the test class would not be compiled. I've experimented with compiling myself, but had classpath issues (see CracTest.main for details). ------------- PR: https://git.openjdk.org/crac/pull/50 From duke at openjdk.org Thu Mar 2 16:57:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 2 Mar 2023 16:57:41 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v2] In-Reply-To: References: Message-ID: <9TPCQqLAaT2eCaEqQvKOrKmR-tcaGc9x8P9xQECe1dE=.b750e051-fc93-4e02-af66-f4361abef3ae@github.com> > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge remote-tracking branch 'origin/crac' into test-crac-in-gha - Add CRaC-specific tests to GHA ------------- Changes: - all: https://git.openjdk.org/crac/pull/48/files - new: https://git.openjdk.org/crac/pull/48/files/a3e77b56..8a9bb6e0 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=48&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=48&range=00-01 Stats: 565 lines in 29 files changed: 510 ins; 16 del; 39 mod Patch: https://git.openjdk.org/crac/pull/48.diff Fetch: git fetch https://git.openjdk.org/crac pull/48/head:pull/48 PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Fri Mar 3 07:20:37 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 3 Mar 2023 07:20:37 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v2] In-Reply-To: <9TPCQqLAaT2eCaEqQvKOrKmR-tcaGc9x8P9xQECe1dE=.b750e051-fc93-4e02-af66-f4361abef3ae@github.com> References: <9TPCQqLAaT2eCaEqQvKOrKmR-tcaGc9x8P9xQECe1dE=.b750e051-fc93-4e02-af66-f4361abef3ae@github.com> Message-ID: On Thu, 2 Mar 2023 16:57:41 GMT, Radim Vansa wrote: >> Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. >> Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge remote-tracking branch 'origin/crac' into test-crac-in-gha > - Add CRaC-specific tests to GHA Apparently there's an issue with permissions, and/or the testsuite does not import correct CRIU version. I'll investigate. ------------- PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Fri Mar 3 07:47:22 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 3 Mar 2023 07:47:22 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v3] In-Reply-To: References: Message-ID: > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: - Debug GHA - Debug GHA ------------- Changes: - all: https://git.openjdk.org/crac/pull/48/files - new: https://git.openjdk.org/crac/pull/48/files/8a9bb6e0..358a4816 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=48&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=48&range=01-02 Stats: 50 lines in 2 files changed: 15 ins; 0 del; 35 mod Patch: https://git.openjdk.org/crac/pull/48.diff Fetch: git fetch https://git.openjdk.org/crac pull/48/head:pull/48 PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Fri Mar 3 07:54:08 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 3 Mar 2023 07:54:08 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v4] In-Reply-To: References: Message-ID: > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Debug GHA ------------- Changes: - all: https://git.openjdk.org/crac/pull/48/files - new: https://git.openjdk.org/crac/pull/48/files/358a4816..04996e37 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=48&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=48&range=02-03 Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/crac/pull/48.diff Fetch: git fetch https://git.openjdk.org/crac pull/48/head:pull/48 PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Fri Mar 3 08:13:57 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 3 Mar 2023 08:13:57 GMT Subject: [crac] Withdrawn: Add CRaC-specific tests to GHA In-Reply-To: References: Message-ID: On Wed, 22 Feb 2023 15:08:29 GMT, Radim Vansa wrote: > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Fri Mar 3 14:51:43 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 3 Mar 2023 14:51:43 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v4] In-Reply-To: References: Message-ID: <3jVi-WQXjL57QYgnjRf6HuEdL3z2P3YOS0M-k1PXtvY=.ad24945d-2a2b-49a0-a3c9-0c57e9cae746@github.com> On Thu, 16 Feb 2023 11:18:34 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 16 additional commits since the last revision: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Use descriptor access rather than extending API > - 8272472: StackGuardPages test doesn't build with glibc 2.34 > > Backport-of: f77a1a156f3da9068d012d9227c7ee0fee58f571 > - Empty commit to trigger GHA > - Drop native FDs tracking > - Avoid claiming invalid FileDescriptor > - Whitelist RandomAccessFile opening classpath files > > This is a workaround for some frameworks opening classpath files in > a non-standard way. > - Add tracking of FD origin > - Track FileDescriptors closed by NIO > - Track native FDs from EPoll > - ... and 6 more: https://git.openjdk.org/crac/compare/18460694...9b5e7edd @AntonKozlov Bump, I think the comments were handled. ------------- PR: https://git.openjdk.org/crac/pull/43 From rmarchenko at openjdk.org Mon Mar 6 12:43:13 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Mon, 6 Mar 2023 12:43:13 GMT Subject: [crac] RFR: CRaC may exit before image dump is completed [v3] In-Reply-To: <_lLFnw-pmrBhAQCc7tE3ScOUC_xYlP7FTSfHLNk5Z8Y=.efb73ac6-4f06-4b41-ac49-034660a7272d@github.com> References: <_lLFnw-pmrBhAQCc7tE3ScOUC_xYlP7FTSfHLNk5Z8Y=.efb73ac6-4f06-4b41-ac49-034660a7272d@github.com> Message-ID: On Wed, 1 Mar 2023 16:30:47 GMT, Anton Kozlov wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Returning status of the child only > > src/java.base/share/native/launcher/main.c line 120: > >> 118: pid = wait(&st); >> 119: if (pid == g_child_pid && WIFEXITED(st)) { >> 120: status = WEXITSTATUS(st); > > Sorry for nit-picking, but now if the java was killed (`WIFEXITED == false`) we won't update status and will return `0`, which does not look correct. `restorewait` in this situation returns `1` [1], although better, also does not look perfect. Here I suggest be at least consistent with restorewait. > > Or we can fix restorewait as well, indicating being killed by returning `128+signal`, as described in the bash manual [2]. How does it sound? > >> When a command terminates on a fatal signal N, bash uses the value of 128+N as the exit status. > > [1] https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/criuengine/criuengine.c#L306 > [2] https://linux.die.net/man/1/bash @AntonKozlov I personnally would prefer to exit 0 on checkpoing to indicate the process is succussfully finished. On the other hand I have no idea how can we recognize cases the child process is actually killed by someone else. So I agree with idea to return an appropriate code 128+N, as well as for restorewait, to keep it consistent. ------------- PR: https://git.openjdk.org/crac/pull/46 From rmarchenko at openjdk.org Tue Mar 7 08:35:33 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 7 Mar 2023 08:35:33 GMT Subject: [crac] RFR: CRaC may exit before image dump is completed [v3] In-Reply-To: References: <_lLFnw-pmrBhAQCc7tE3ScOUC_xYlP7FTSfHLNk5Z8Y=.efb73ac6-4f06-4b41-ac49-034660a7272d@github.com> Message-ID: On Mon, 6 Mar 2023 12:40:27 GMT, Roman Marchenko wrote: >> src/java.base/share/native/launcher/main.c line 120: >> >>> 118: pid = wait(&st); >>> 119: if (pid == g_child_pid && WIFEXITED(st)) { >>> 120: status = WEXITSTATUS(st); >> >> Sorry for nit-picking, but now if the java was killed (`WIFEXITED == false`) we won't update status and will return `0`, which does not look correct. `restorewait` in this situation returns `1` [1], although better, also does not look perfect. Here I suggest be at least consistent with restorewait. >> >> Or we can fix restorewait as well, indicating being killed by returning `128+signal`, as described in the bash manual [2]. How does it sound? >> >>> When a command terminates on a fatal signal N, bash uses the value of 128+N as the exit status. >> >> [1] https://github.com/openjdk/crac/blob/crac/src/java.base/unix/native/criuengine/criuengine.c#L306 >> [2] https://linux.die.net/man/1/bash > > @AntonKozlov > I personally would prefer to exit 0 on checkpoing to indicate the process is successfully finished. On the other hand I have no idea how can we recognize cases the child process is actually killed by someone else. So I agree with idea to return an appropriate code 128+N, as well as for restorewait, to keep it consistent. To make things iterative, I suggest to implement wait_for_children() in the same way as restorewait() for now, and then create the next PR to make appropriate changes related to signal handling and returning exit codes. ------------- PR: https://git.openjdk.org/crac/pull/46 From rmarchenko at openjdk.org Tue Mar 7 09:42:24 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 7 Mar 2023 09:42:24 GMT Subject: [crac] RFR: CRaC may exit before image dump is completed [v4] In-Reply-To: References: Message-ID: > When running CRaC with docker, java may exit before CRIU is finished dumpring because CRIU kills the original java process, and then docker immediately exits. > > It could be reproduced with a simple Java test: > > public class Test { > public static void main(String args[]) throws Exception { > jdk.crac.Core.checkpointRestore(); > System.out.println("finish"); > } > } > > and run with docker: > `docker $JAVA_HOME/java -XX:CRaCCheckpointTo=./cr Test.java` > > After the command above finishes, `cr/cppath` is absent in the case of failure. Or/and it will fail on restore: > `docker $JAVA_HOME/java -XX:CRaCRestoreFrom=./cr` > > This change fixes the issue by forkin'g the main process in case of PID=1 (pid=1 means it was run with docker), and waiting for children processes are finished. This makes us sure that CRIU finalized the dump, if any. At the same time, there is no conflict with PIDs on restore, since the process being restored has PID not equal to 1, if restoring with the command above.. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Fixing review comments ------------- Changes: - all: https://git.openjdk.org/crac/pull/46/files - new: https://git.openjdk.org/crac/pull/46/files/a0a78087..4191fb7f Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=46&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=46&range=02-03 Stats: 8 lines in 1 file changed: 5 ins; 0 del; 3 mod Patch: https://git.openjdk.org/crac/pull/46.diff Fetch: git fetch https://git.openjdk.org/crac pull/46/head:pull/46 PR: https://git.openjdk.org/crac/pull/46 From duke at openjdk.org Tue Mar 7 12:10:56 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 7 Mar 2023 12:10:56 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v5] In-Reply-To: References: Message-ID: <5anBWxOXjtuZhYx0Xl7fhDqnGiosvxdLdQJ_WsEVDGc=.994f145f-f2cb-4355-b4d9-aa251c7676cd@github.com> > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Radim Vansa has updated the pull request incrementally with 16 additional commits since the last revision: - Finally fix ResolveTest - Add seemingly non-vital logs - Revert all debug info and hope for the best - what's wrong with cppath - Fix path - fix path - Fix criu SHA256 - Print dump log - Use newer criu - Debug failing ResolveTest in GHA - ... and 6 more: https://git.openjdk.org/crac/compare/04996e37...1f9645ee ------------- Changes: - all: https://git.openjdk.org/crac/pull/48/files - new: https://git.openjdk.org/crac/pull/48/files/04996e37..1f9645ee Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=48&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=48&range=03-04 Stats: 90 lines in 4 files changed: 40 ins; 15 del; 35 mod Patch: https://git.openjdk.org/crac/pull/48.diff Fetch: git fetch https://git.openjdk.org/crac pull/48/head:pull/48 PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Tue Mar 7 12:10:59 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 7 Mar 2023 12:10:59 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v4] In-Reply-To: References: Message-ID: On Fri, 3 Mar 2023 07:54:08 GMT, Radim Vansa wrote: >> Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. >> Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Debug GHA Closing for now to save people from notifications on push. I was finally able to resolve some test failures. Part of that problem was using an older version of CRIU (the one README links to, not release 1.3), but also ResolveTest was prone to the issue that should be fixed in #46 (added a workaround to the test). ------------- PR: https://git.openjdk.org/crac/pull/48 From akozlov at openjdk.org Tue Mar 7 12:53:48 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 7 Mar 2023 12:53:48 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java In-Reply-To: References: Message-ID: On Fri, 24 Feb 2023 09:20:05 GMT, Radim Vansa wrote: > Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 > > Please see `CracTest` javadoc for detailed info. This looks really promising! Thank you very much for working on this! I've asked folks to give a closer look to tests changes for any subtle semanitc change, etc. test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java line 56: > 54: * @run driver jdk.test.lib.crac.CracTest > 55: */ > 56: public class ResolveTest implements CracTest { Also, that's strange, byt when running tests in batch, I get a problem with docker tests. Apparently CracTest is not copied to the classes dir. When I rerun the test individually, it fails. But if I clean the jtreg workdir and run the test individually, it passes. anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/ Directory "JTwork" not found: creating Test results: passed: 25; failed: 1 Results written to /home/anton/proj/crac/JTwork Error: Some tests failed or other problems occurred. JTwork/jdk/crac/java/net/InetAddress/ResolveTest.jtr:20:execStatus=Failed. Execution failed: `main' threw exception: java.util.concurrent.CancellationException anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java Test results: failed: 1 Results written to /home/anton/proj/crac/JTwork Error: Some tests failed or other problems occurred. ----------System.err:(21/1653)---------- Starting docker container: docker run --rm -d --privileged --init --volume /home/anton/proj/crac/JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d:/test-classes/ --volume cr:/cr --volume /home/anton/proj/crac/build/linux-x86_64-server-fastdebug/images/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test --add-host some.test.hostname.example.com:192.168.12.34 jdk-internal:test-inet-address sleep 3600 Starting process to be checkpointed: docker exec crac-test /jdk/bin/java -ea -cp /test-classes -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ ResolveTest some.test.hostname.example.com /second-run ERROR: Error: Could not find or load main class jdk.test.lib.crac.CracTest ERROR: Caused by: java.lang.ClassNotFoundException: jdk.test.lib.crac.CracTest java.util.concurrent.CancellationException at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478) at ResolveTest.lambda$test$1(ResolveTest.java:85) at jdk.test.lib.crac.CracProcess$1.processLine(CracProcess.java:105) at jdk.test.lib.process.StreamPumper.lambda$run$0(StreamPumper.java:127) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at jdk.test.lib.process.StreamPumper.run(StreamPumper.java:127) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.lang.Thread.run(Thread.java:833) anton at mercury:~/proj/crac$ ls JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d/jdk/test/lib/crac/CracTest.class ls: cannot access 'JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d/jdk/test/lib/crac/CracTest.class': No such file or directory anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -w:JTwork.resolve -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java Directory "JTwork.resolve" not found: creating Test results: passed: 1 Results written to /home/anton/proj/crac/JTwork.resolve test/lib/jdk/test/lib/crac/CracEngine.java line 4: > 2: > 3: public enum CracEngine { > 4: CRIU("criu"), This should be `criuengine` ------------- PR: https://git.openjdk.org/crac/pull/50 From akozlov at openjdk.org Tue Mar 7 13:05:12 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 7 Mar 2023 13:05:12 GMT Subject: [crac] RFR: CRaC may exit before image dump is completed [v3] In-Reply-To: References: <_lLFnw-pmrBhAQCc7tE3ScOUC_xYlP7FTSfHLNk5Z8Y=.efb73ac6-4f06-4b41-ac49-034660a7272d@github.com> Message-ID: On Tue, 7 Mar 2023 08:31:55 GMT, Roman Marchenko wrote: >> @AntonKozlov >> I personally would prefer to exit 0 on checkpoing to indicate the process is successfully finished. On the other hand I have no idea how can we recognize cases the child process is actually killed by someone else. So I agree with idea to return an appropriate code 128+N, as well as for restorewait, to keep it consistent. > > To make things iterative, I suggest to implement wait_for_children() in the same way as restorewait() for now, and then create the next PR to make appropriate changes related to signal handling and returning exit codes. OK, makes sense for me as well. ------------- PR: https://git.openjdk.org/crac/pull/46 From rmarchenko at openjdk.org Tue Mar 7 13:37:21 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Tue, 7 Mar 2023 13:37:21 GMT Subject: [crac] RFR: CRaC may exit before image dump is completed [v5] In-Reply-To: References: Message-ID: > When running CRaC with docker, java may exit before CRIU is finished dumpring because CRIU kills the original java process, and then docker immediately exits. > > It could be reproduced with a simple Java test: > > public class Test { > public static void main(String args[]) throws Exception { > jdk.crac.Core.checkpointRestore(); > System.out.println("finish"); > } > } > > and run with docker: > `docker $JAVA_HOME/java -XX:CRaCCheckpointTo=./cr Test.java` > > After the command above finishes, `cr/cppath` is absent in the case of failure. Or/and it will fail on restore: > `docker $JAVA_HOME/java -XX:CRaCRestoreFrom=./cr` > > This change fixes the issue by forkin'g the main process in case of PID=1 (pid=1 means it was run with docker), and waiting for children processes are finished. This makes us sure that CRIU finalized the dump, if any. At the same time, there is no conflict with PIDs on restore, since the process being restored has PID not equal to 1, if restoring with the command above.. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Minor change ------------- Changes: - all: https://git.openjdk.org/crac/pull/46/files - new: https://git.openjdk.org/crac/pull/46/files/4191fb7f..c117d076 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=46&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=46&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/46.diff Fetch: git fetch https://git.openjdk.org/crac pull/46/head:pull/46 PR: https://git.openjdk.org/crac/pull/46 From akozlov at openjdk.org Tue Mar 7 14:21:37 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 7 Mar 2023 14:21:37 GMT Subject: [crac] RFR: CRaC may exit before image dump is completed [v5] In-Reply-To: References: Message-ID: <8F9GccVSYHZOSmIuD5WAJY85g_QUkQLy4xC0fAW9MWU=.41dee883-06cb-4ab0-b040-f38f99190f8d@github.com> On Tue, 7 Mar 2023 13:37:21 GMT, Roman Marchenko wrote: >> When running CRaC with docker, java may exit before CRIU is finished dumpring because CRIU kills the original java process, and then docker immediately exits. >> >> It could be reproduced with a simple Java test: >> >> public class Test { >> public static void main(String args[]) throws Exception { >> jdk.crac.Core.checkpointRestore(); >> System.out.println("finish"); >> } >> } >> >> and run with docker: >> `docker $JAVA_HOME/java -XX:CRaCCheckpointTo=./cr Test.java` >> >> After the command above finishes, `cr/cppath` is absent in the case of failure. Or/and it will fail on restore: >> `docker $JAVA_HOME/java -XX:CRaCRestoreFrom=./cr` >> >> This change fixes the issue by forkin'g the main process in case of PID=1 (pid=1 means it was run with docker), and waiting for children processes are finished. This makes us sure that CRIU finalized the dump, if any. At the same time, there is no conflict with PIDs on restore, since the process being restored has PID not equal to 1, if restoring with the command above.. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Minor change LGTM, thanks! ------------- Marked as reviewed by akozlov (Lead). PR: https://git.openjdk.org/crac/pull/46 From duke at openjdk.org Tue Mar 7 14:23:33 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 7 Mar 2023 14:23:33 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v2] In-Reply-To: References: Message-ID: <89CHDWoectWwKOK14rsDvpNNqhW4a_ewWDHfafkStJE=.6b3ebc3e-e6af-4df8-a8da-c4e317da3757@github.com> On Tue, 7 Mar 2023 12:51:12 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Some other improvements and fixes > > test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java line 56: > >> 54: * @run driver jdk.test.lib.crac.CracTest >> 55: */ >> 56: public class ResolveTest implements CracTest { > > Also, that's strange, byt when running tests in batch, I get a problem with docker tests. Apparently CracTest is not copied to the classes dir. When I rerun the test individually, it fails. But if I clean the jtreg workdir and run the test individually, it passes. > > > anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/ > Directory "JTwork" not found: creating > Test results: passed: 25; failed: 1 > Results written to /home/anton/proj/crac/JTwork > Error: Some tests failed or other problems occurred. > > > JTwork/jdk/crac/java/net/InetAddress/ResolveTest.jtr:20:execStatus=Failed. Execution failed: `main' threw > exception: java.util.concurrent.CancellationException > > > anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java > Test results: failed: 1 > Results written to /home/anton/proj/crac/JTwork > Error: Some tests failed or other problems occurred. > > > > ----------System.err:(21/1653)---------- > Starting docker container: > docker run --rm -d --privileged --init --volume /home/anton/proj/crac/JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d:/test-classes/ --volume cr:/cr --volume /home/anton/proj/crac/build/linux-x86_64-server-fastdebug/images/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test --add-host some.test.hostname.example.com:192.168.12.34 jdk-internal:test-inet-address sleep 3600 > Starting process to be checkpointed: > docker exec crac-test /jdk/bin/java -ea -cp /test-classes -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ ResolveTest some.test.hostname.example.com /second-run > ERROR: Error: Could not find or load main class jdk.test.lib.crac.CracTest > ERROR: Caused by: java.lang.ClassNotFoundException: jdk.test.lib.crac.CracTest > java.util.concurrent.CancellationException > at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478) > at ResolveTest.lambda$test$1(ResolveTest.java:85) > at jdk.test.lib.crac.CracProcess$1.processLine(CracProcess.java:105) > at jdk.test.lib.process.StreamPumper.lambda$run$0(StreamPumper.java:127) > at java.base/java.lang.Iterable.forEach(Iterable.java:75) > at jdk.test.lib.process.StreamPumper.run(StreamPumper.java:127) > at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at java.base/java.lang.Thread.run(Thread.java:833) > > > > anton at mercury:~/proj/crac$ ls JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d/jdk/test/lib/crac/CracTest.class > ls: cannot access 'JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d/jdk/test/lib/crac/CracTest.class': No such file or directory > > > anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -w:JTwork.resolve -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java > Directory "JTwork.resolve" not found: creating > Test results: passed: 1 > Results written to /home/anton/proj/crac/JTwork.resolve You might be using an old version of JTReg? I don't have that issue with 7.1.1, so some bugs may have been fixed. Although I must admit that when I was trying to avoid the `@build ResolveTest` in the headers I had similar problems with classes not being compiled. > test/lib/jdk/test/lib/crac/CracEngine.java line 4: > >> 2: >> 3: public enum CracEngine { >> 4: CRIU("criu"), > > This should be `criuengine` fixed ------------- PR: https://git.openjdk.org/crac/pull/50 From duke at openjdk.org Tue Mar 7 14:23:28 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 7 Mar 2023 14:23:28 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v2] In-Reply-To: References: Message-ID: > Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 > > Please see `CracTest` javadoc for detailed info. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Some other improvements and fixes ------------- Changes: - all: https://git.openjdk.org/crac/pull/50/files - new: https://git.openjdk.org/crac/pull/50/files/3cfa1cf0..ea9debd3 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=50&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=50&range=00-01 Stats: 81 lines in 4 files changed: 43 ins; 25 del; 13 mod Patch: https://git.openjdk.org/crac/pull/50.diff Fetch: git fetch https://git.openjdk.org/crac pull/50/head:pull/50 PR: https://git.openjdk.org/crac/pull/50 From akozlov at openjdk.org Tue Mar 7 14:33:27 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 7 Mar 2023 14:33:27 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v5] In-Reply-To: <5anBWxOXjtuZhYx0Xl7fhDqnGiosvxdLdQJ_WsEVDGc=.994f145f-f2cb-4355-b4d9-aa251c7676cd@github.com> References: <5anBWxOXjtuZhYx0Xl7fhDqnGiosvxdLdQJ_WsEVDGc=.994f145f-f2cb-4355-b4d9-aa251c7676cd@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Tue, 7 Mar 2023 12:10:56 GMT, Radim Vansa wrote: >> Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. >> Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 > > Radim Vansa has updated the pull request incrementally with 16 additional commits since the last revision: > > - Finally fix ResolveTest > - Add seemingly non-vital logs > - Revert all debug info and hope for the best > - what's wrong with cppath > - Fix path > - fix path > - Fix criu SHA256 > - Print dump log > - Use newer criu > - Debug failing ResolveTest in GHA > - ... and 6 more: https://git.openjdk.org/crac/compare/04996e37...1f9645ee test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java line 105: > 103: // not be able to restore the process. > 104: // This workaround should not be necessary when https://github.com/openjdk/crac/pull/46 > 105: // is integrated. I propose to wait until #46 is integrated and remove the mention (and workaround) test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java line 107: > 105: // is integrated. > 106: List javaCmd = new ArrayList<>(); > 107: javaCmd.addAll(Arrays.asList("/jdk/bin/java", "-cp", "/test-classes/", "-XX:CRaCCheckpointTo=/cr")); This looks belonging to #50. Could this be the reason for strange failures of this test I've reported there? ------------- PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Tue Mar 7 14:37:31 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 7 Mar 2023 14:37:31 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v5] In-Reply-To: References: <5anBWxOXjtuZhYx0Xl7fhDqnGiosvxdLdQJ_WsEVDGc=.994f145f-f2cb-4355-b4d9-aa251c7676cd@github.com> Message-ID: On Tue, 7 Mar 2023 14:23:50 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with 16 additional commits since the last revision: >> >> - Finally fix ResolveTest >> - Add seemingly non-vital logs >> - Revert all debug info and hope for the best >> - what's wrong with cppath >> - Fix path >> - fix path >> - Fix criu SHA256 >> - Print dump log >> - Use newer criu >> - Debug failing ResolveTest in GHA >> - ... and 6 more: https://git.openjdk.org/crac/compare/04996e37...1f9645ee > > test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java line 105: > >> 103: // not be able to restore the process. >> 104: // This workaround should not be necessary when https://github.com/openjdk/crac/pull/46 >> 105: // is integrated. > > I propose to wait until #46 is integrated and remove the mention (and workaround) I would integrate code when it's ready reagardless of other PRs (haven't actually checked if this works with #46). This code will be anyway rewritten with #50 ------------- PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Tue Mar 7 14:40:13 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 7 Mar 2023 14:40:13 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v5] In-Reply-To: References: <5anBWxOXjtuZhYx0Xl7fhDqnGiosvxdLdQJ_WsEVDGc=.994f145f-f2cb-4355-b4d9-aa251c7676cd@github.com> Message-ID: On Tue, 7 Mar 2023 14:29:01 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with 16 additional commits since the last revision: >> >> - Finally fix ResolveTest >> - Add seemingly non-vital logs >> - Revert all debug info and hope for the best >> - what's wrong with cppath >> - Fix path >> - fix path >> - Fix criu SHA256 >> - Print dump log >> - Use newer criu >> - Debug failing ResolveTest in GHA >> - ... and 6 more: https://git.openjdk.org/crac/compare/04996e37...1f9645ee > > test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java line 107: > >> 105: // is integrated. >> 106: List javaCmd = new ArrayList<>(); >> 107: javaCmd.addAll(Arrays.asList("/jdk/bin/java", "-cp", "/test-classes/", "-XX:CRaCCheckpointTo=/cr")); > > This looks belonging to #50. Could this be the reason for strange failures of this test I've reported there? Why would you think that? There's a different variant of this test in #50. This particular change, while correct, has no effect when we concatenate the list anyway, it's there only for clarity. I had to change the line when experimenting with `--init` version without the bash loop workaround. ------------- PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Tue Mar 7 14:49:27 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 7 Mar 2023 14:49:27 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v3] In-Reply-To: References: Message-ID: > Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 > > Please see `CracTest` javadoc for detailed info. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Remove forgotten debugging command ------------- Changes: - all: https://git.openjdk.org/crac/pull/50/files - new: https://git.openjdk.org/crac/pull/50/files/ea9debd3..57e73973 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=50&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=50&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/crac/pull/50.diff Fetch: git fetch https://git.openjdk.org/crac pull/50/head:pull/50 PR: https://git.openjdk.org/crac/pull/50 From akozlov at openjdk.org Wed Mar 8 14:39:35 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 8 Mar 2023 14:39:35 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v3] In-Reply-To: <89CHDWoectWwKOK14rsDvpNNqhW4a_ewWDHfafkStJE=.6b3ebc3e-e6af-4df8-a8da-c4e317da3757@github.com> References: <89CHDWoectWwKOK14rsDvpNNqhW4a_ewWDHfafkStJE=.6b3ebc3e-e6af-4df8-a8da-c4e317da3757@github.com> Message-ID: The message from this sender included one or more files which could not be scanned for virus detection; do not open these files unless you are certain of the sender's intent. ---------------------------------------------------------------------- On Tue, 7 Mar 2023 14:18:39 GMT, Radim Vansa wrote: >> test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java line 56: >> >>> 54: * @run driver jdk.test.lib.crac.CracTest >>> 55: */ >>> 56: public class ResolveTest implements CracTest { >> >> Also, that's strange, byt when running tests in batch, I get a problem with docker tests. Apparently CracTest is not copied to the classes dir. When I rerun the test individually, it fails. But if I clean the jtreg workdir and run the test individually, it passes. >> >> >> anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/ >> Directory "JTwork" not found: creating >> Test results: passed: 25; failed: 1 >> Results written to /home/anton/proj/crac/JTwork >> Error: Some tests failed or other problems occurred. >> >> >> JTwork/jdk/crac/java/net/InetAddress/ResolveTest.jtr:20:execStatus=Failed. Execution failed: `main' threw >> exception: java.util.concurrent.CancellationException >> >> >> anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java >> Test results: failed: 1 >> Results written to /home/anton/proj/crac/JTwork >> Error: Some tests failed or other problems occurred. >> >> >> >> ----------System.err:(21/1653)---------- >> Starting docker container: >> docker run --rm -d --privileged --init --volume /home/anton/proj/crac/JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d:/test-classes/ --volume cr:/cr --volume /home/anton/proj/crac/build/linux-x86_64-server-fastdebug/images/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test --add-host some.test.hostname.example.com:192.168.12.34 jdk-internal:test-inet-address sleep 3600 >> Starting process to be checkpointed: >> docker exec crac-test /jdk/bin/java -ea -cp /test-classes -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ ResolveTest some.test.hostname.example.com /second-run >> ERROR: Error: Could not find or load main class jdk.test.lib.crac.CracTest >> ERROR: Caused by: java.lang.ClassNotFoundException: jdk.test.lib.crac.CracTest >> java.util.concurrent.CancellationException >> at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2478) >> at ResolveTest.lambda$test$1(ResolveTest.java:85) >> at jdk.test.lib.crac.CracProcess$1.processLine(CracProcess.java:105) >> at jdk.test.lib.process.StreamPumper.lambda$run$0(StreamPumper.java:127) >> at java.base/java.lang.Iterable.forEach(Iterable.java:75) >> at jdk.test.lib.process.StreamPumper.run(StreamPumper.java:127) >> at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) >> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) >> at java.base/java.lang.Thread.run(Thread.java:833) >> >> >> >> anton at mercury:~/proj/crac$ ls JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d/jdk/test/lib/crac/CracTest.class >> ls: cannot access 'JTwork/classes/jdk/crac/java/net/InetAddress/ResolveTest.d/jdk/test/lib/crac/CracTest.class': No such file or directory >> >> >> anton at mercury:~/proj/crac$ ~/Downloads/jtreg-6+1/bin/jtreg -w:JTwork.resolve -nr -jdk:build/linux-x86_64-server-fastdebug/images/jdk/ test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java >> Directory "JTwork.resolve" not found: creating >> Test results: passed: 1 >> Results written to /home/anton/proj/crac/JTwork.resolve > > You might be using an old version of JTReg? I don't have that issue with 7.1.1, so some bugs may have been fixed. Although I must admit that when I was trying to avoid the `@build ResolveTest` in the headers I had similar problems with classes not being compiled. After switching to jtreg-7.1+1, the did not reproduce. OK, let's assume that was a jtreg problem. ------------- PR: https://git.openjdk.org/crac/pull/50 From akozlov at openjdk.org Fri Mar 10 09:15:00 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 10 Mar 2023 09:15:00 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v3] In-Reply-To: References: Message-ID: On Tue, 7 Mar 2023 14:49:27 GMT, Radim Vansa wrote: >> Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 >> >> Please see `CracTest` javadoc for detailed info. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove forgotten debugging command test/jdk/jdk/crac/JarFileFactoryCacheTest/JarFileFactoryCacheTest.java line 61: > 59: } > 60: assert temp.toFile().delete(); > 61: } Looks fine, although not very expected. The test is adjusted to generate the text file, so it is also deleted after the jar is generated. A nit: probably `temp.resolve("test.txt")` worth to be computed once. test/jdk/jdk/crac/SecureRandom/InterlockTest.java line 119: > 117: public static void main(String[] args) throws Exception { > 118: CracTest.run(InterlockTest.class, args); > 119: } Is it a leftover that can be deleted, or do we need to help CracTest to find the test class in this test? test/jdk/jdk/crac/Selector/Test970/ChannelResource.java line 34: > 32: public enum SelectionType {SELECT, SELECT_TIMEOUT, SELECT_NOW} > 33: > 34: ; Something wrong with formatting here test/jdk/jdk/crac/Selector/Test970/ChannelResource.java line 57: > 55: } > 56: > 57: @java.lang.Override Maybe just `@Override`? ------------- PR: https://git.openjdk.org/crac/pull/50 From duke at openjdk.org Fri Mar 10 13:46:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 10 Mar 2023 13:46:38 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v3] In-Reply-To: References: Message-ID: On Wed, 8 Mar 2023 14:39:25 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove forgotten debugging command > > test/jdk/jdk/crac/JarFileFactoryCacheTest/JarFileFactoryCacheTest.java line 61: > >> 59: } >> 60: assert temp.toFile().delete(); >> 61: } > > Looks fine, although not very expected. The test is adjusted to generate the text file, so it is also deleted after the jar is generated. > > A nit: probably `temp.resolve("test.txt")` worth to be computed once. I am not sure what you tried to say in the first paragraph; is this a suggestion? Yes, I changed the code to get rid of extra resource file. I'll put the resolved path to a var. ------------- PR: https://git.openjdk.org/crac/pull/50 From duke at openjdk.org Fri Mar 10 14:42:25 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 10 Mar 2023 14:42:25 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v4] In-Reply-To: References: Message-ID: > Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 > > Please see `CracTest` javadoc for detailed info. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/crac/pull/50/files - new: https://git.openjdk.org/crac/pull/50/files/57e73973..4114106a Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=50&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=50&range=02-03 Stats: 15 lines in 3 files changed: 3 ins; 4 del; 8 mod Patch: https://git.openjdk.org/crac/pull/50.diff Fetch: git fetch https://git.openjdk.org/crac pull/50/head:pull/50 PR: https://git.openjdk.org/crac/pull/50 From duke at openjdk.org Fri Mar 10 14:42:26 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 10 Mar 2023 14:42:26 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v3] In-Reply-To: References: Message-ID: <-5LQCF5WB7DIK_s2LIfe8mzYVuADo9lBLGPlRGaiY-c=.c61dd21c-0c25-4b06-a445-324780d4b404@github.com> On Tue, 7 Mar 2023 14:49:27 GMT, Radim Vansa wrote: >> Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 >> >> Please see `CracTest` javadoc for detailed info. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove forgotten debugging command Thanks for spotting those, TBH I haven't read through the ChannelResource after refactoring. Addressed. ------------- PR: https://git.openjdk.org/crac/pull/50 From akozlov at openjdk.org Fri Mar 10 16:22:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 10 Mar 2023 16:22:43 GMT Subject: [crac] RFR: Convert CRaC tests from shell scripts to Java [v4] In-Reply-To: References: Message-ID: <8gTUTK93RwI2TvLmuVPDq-DplEnwcEXhiZikRQIKMr8=.3f0a5ce8-d543-414b-b55d-40eea922fde0@github.com> On Fri, 10 Mar 2023 14:42:25 GMT, Radim Vansa wrote: >> Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 >> >> Please see `CracTest` javadoc for detailed info. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments LGTM. Thank you a lot! ------------- Marked as reviewed by akozlov (Lead). PR: https://git.openjdk.org/crac/pull/50 From duke at openjdk.org Tue Mar 14 13:27:54 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 14 Mar 2023 13:27:54 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v10] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 > > That IMO does not preclude trying the same for this case. > > - Debian 11 x86_64: It does not work, glibc is too different and inlined there. > - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. > - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Fix strlen_local for gcc optimizations; but it did not happen with OpenJDK. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/8fdb17d2..ff0f4eab Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=09 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=08-09 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Tue Mar 14 13:55:32 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 14 Mar 2023 13:55:32 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v11] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 > > That IMO does not preclude trying the same for this case. > > - Debian 11 x86_64: It does not work, glibc is too different and inlined there. > - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. > - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Fix a race as found by Adhemerval Zanella Netto. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/ff0f4eab..f9116d6e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=10 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=09-10 Stats: 83 lines in 1 file changed: 28 ins; 26 del; 29 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Tue Mar 14 14:25:51 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 14 Mar 2023 14:25:51 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v12] In-Reply-To: References: Message-ID: <5LsUWnfiXcUJ0JCrZer8Yw7-0Wkj78WwcEXbJ3rBDI0=.58e3bdc4-11ad-425d-872e-f2a163981e2c@github.com> > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 > > That IMO does not preclude trying the same for this case. > > - Debian 11 x86_64: It does not work, glibc is too different and inlined there. > - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. > - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Fix whitespaces. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/f9116d6e..ac4825d7 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=11 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=10-11 Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From akozlov at openjdk.org Wed Mar 15 14:16:15 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 15 Mar 2023 14:16:15 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v4] In-Reply-To: References: Message-ID: <4wkAuIGHgohQBmSVv1ot-kU3aQfRGI_Vqss610uAFsk=.298e2342-b56a-444f-b7d5-dd445b53dae9@github.com> On Thu, 16 Feb 2023 11:18:34 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 16 additional commits since the last revision: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Use descriptor access rather than extending API > - 8272472: StackGuardPages test doesn't build with glibc 2.34 > > Backport-of: f77a1a156f3da9068d012d9227c7ee0fee58f571 > - Empty commit to trigger GHA > - Drop native FDs tracking > - Avoid claiming invalid FileDescriptor > - Whitelist RandomAccessFile opening classpath files > > This is a workaround for some frameworks opening classpath files in > a non-standard way. > - Add tracking of FD origin > - Track FileDescriptors closed by NIO > - Track native FDs from EPoll > - ... and 6 more: https://git.openjdk.org/crac/compare/b3868d62...9b5e7edd src/java.base/share/classes/java/io/FileDescriptor.java line 370: > 368: JDKContext ctx = jdk.internal.crac.Core.getJDKContext(); > 369: if (ctx.claimFdWeak(this, this)) { > 370: throw new CheckpointOpenFileException("FileDescriptor " + this.fd + " left open. " + JDKContext.COLLECT_FD_STACKTRACES_HINT, resource.stackTraceHolder); The hint is apparently printed even if that is enabled. Maybe let CheckpointOpenResource decide about the hint? Or have a centralized message formatter somewhere else. src/java.base/share/classes/jdk/crac/Core.java line 138: > 136: fdArr[i] = claimedPairs.get(i).getKey(); > 137: objArr[i] = claimedPairs.get(i).getValue(); > 138: System.out.printf("%d %s\n", fdArr[i], objArr[i]); Left-over from a debugging apparently src/java.base/unix/classes/sun/nio/ch/FileDispatcherImpl.java line 252: > 250: throws IOException; > 251: > 252: static native void close0(FileDescriptor fd) throws IOException; Maybe make it private? To prevent uses without `markClosed`. ------------- PR: https://git.openjdk.org/crac/pull/43 From rmarchenko at openjdk.org Wed Mar 15 14:18:58 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Wed, 15 Mar 2023 14:18:58 GMT Subject: [crac] Integrated: CRaC may exit before image dump is completed In-Reply-To: References: Message-ID: On Tue, 21 Feb 2023 13:50:23 GMT, Roman Marchenko wrote: > When running CRaC with docker, java may exit before CRIU is finished dumpring because CRIU kills the original java process, and then docker immediately exits. > > It could be reproduced with a simple Java test: > > public class Test { > public static void main(String args[]) throws Exception { > jdk.crac.Core.checkpointRestore(); > System.out.println("finish"); > } > } > > and run with docker: > `docker $JAVA_HOME/java -XX:CRaCCheckpointTo=./cr Test.java` > > After the command above finishes, `cr/cppath` is absent in the case of failure. Or/and it will fail on restore: > `docker $JAVA_HOME/java -XX:CRaCRestoreFrom=./cr` > > This change fixes the issue by forkin'g the main process in case of PID=1 (pid=1 means it was run with docker), and waiting for children processes are finished. This makes us sure that CRIU finalized the dump, if any. At the same time, there is no conflict with PIDs on restore, since the process being restored has PID not equal to 1, if restoring with the command above.. This pull request has now been integrated. Changeset: a11b46a0 Author: Roman Marchenko Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/a11b46a0f135d522f701d7fbf8833e7670198a19 Stats: 76 lines in 1 file changed: 76 ins; 0 del; 0 mod CRaC may exit before image dump is completed Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/46 From akozlov at azul.com Wed Mar 15 14:36:33 2023 From: akozlov at azul.com (Anton Kozlov) Date: Wed, 15 Mar 2023 16:36:33 +0200 Subject: Proposal for documentation and snapsafety In-Reply-To: <1c0a8870-371f-fe2e-d0a8-f5df241a842d@azul.com> References: <1c0a8870-371f-fe2e-d0a8-f5df241a842d@azul.com> Message-ID: <16df8468-400a-c238-97f5-90ded3b1856a@azul.com> On 2/27/23 11:01, Radim Vansa wrote: > While all JDK code can be eventually fixed in a similar way to SecureRandom I > think that it's clear that not everything can be encapsulated. A good example > are the environment variables, but also number of processors and many others > [2]. It looks the problem here is not technical correctness, but still the user experience, right? I.e. by providing good documentation (javadoc == Java SE API), we can provide a good level of specification of what to expect on checkpoint and on restore. But adhering to that specification is complicated for users, as it is still a hard task to find uses of a method with a semantic that has been changed in CRaC. Is this understanding of the problem correct? > I propose to tag any method/constructor that returns data that could be > expected to stay constant in non-C/R app but often changes after restore, or > an object that will need handling through a registered Resource, with > @CracSensitive (3) annotation. We will provide a tool that will report > places that could call these methods, unless marked with @CracSafe > annotation. This tool could work in a static way (scanning set of JARs, > probably with a thin Maven plugin as well?) and as a javaagent, scanning > classes as these get loaded. > > Naturally not all code invoking non-snap-safe methods is from user code, many > cases come from the libraries. Alternative way to allow-list places calling > @CracSensitive methods in places that cannot be changed directly would be > provided, though eventually we aim at encouraging that the libraries adopt > the @CracSafe internally. While technically this is possible, there are a few drawbacks IMHO. First, the tool and annotations are interdependent, although the dependency of annotations on the tool is implicit. But anyway, annotations do not make any sense without the tool checking them. So, either the tool and annotations are somehow should be completely external to the JDK, or both of them should be in the JDK. But, I'm not sure the tool is the best approach. That does not take advantage of being able to track exact calls of the annotated methods before the checkpoint and after restore. For example, querying the number of processors is fine if happens after the restore. So the tool would need somehow to distinguish calls of annotated methods before checkpoint (where previously returned results may become obsolete) and those after restore, otherwise, there will be some number of false positives, and those false positives would require some way to silence them after consideration. Also, even before the checkpoint, having a call in the code does not mean that will be actually called e.g. because of some specific configuration that disables detection of the number of processes. So it seems without pretty complex static dataflow analysis we'll have another source of false positives. Have you considered taking advantage of actually running the program? E.g. recording stack traces for methods calls and reporting them on the checkpoint, like in PR #43 [1]. Compared to the separate tool, the call recording reports only calls that have happened, and only before the checkpoint. The stack trace provides some information about how the result will be used (although not complete info on how the result of the method is going to be used). The implementation will probably be very simple, and by some convention, we can agree on a way to exclude some stack traces from reporting, e.g. by having a specific stack trace element. Does the tool have an advantage over the recording of method calls and stack traces? [1] https://github.com/openjdk/crac/pull/43 Thanks, Anton From duke at openjdk.org Mon Mar 20 09:59:00 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 20 Mar 2023 09:59:00 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v4] In-Reply-To: <4wkAuIGHgohQBmSVv1ot-kU3aQfRGI_Vqss610uAFsk=.298e2342-b56a-444f-b7d5-dd445b53dae9@github.com> References: <4wkAuIGHgohQBmSVv1ot-kU3aQfRGI_Vqss610uAFsk=.298e2342-b56a-444f-b7d5-dd445b53dae9@github.com> Message-ID: On Mon, 13 Mar 2023 14:03:52 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 16 additional commits since the last revision: >> >> - Merge remote-tracking branch 'origin/crac' into newfd >> - Use descriptor access rather than extending API >> - 8272472: StackGuardPages test doesn't build with glibc 2.34 >> >> Backport-of: f77a1a156f3da9068d012d9227c7ee0fee58f571 >> - Empty commit to trigger GHA >> - Drop native FDs tracking >> - Avoid claiming invalid FileDescriptor >> - Whitelist RandomAccessFile opening classpath files >> >> This is a workaround for some frameworks opening classpath files in >> a non-standard way. >> - Add tracking of FD origin >> - Track FileDescriptors closed by NIO >> - Track native FDs from EPoll >> - ... and 6 more: https://git.openjdk.org/crac/compare/6a0834b4...9b5e7edd > > src/java.base/share/classes/jdk/crac/Core.java line 138: > >> 136: fdArr[i] = claimedPairs.get(i).getKey(); >> 137: objArr[i] = claimedPairs.get(i).getValue(); >> 138: System.out.printf("%d %s\n", fdArr[i], objArr[i]); > > Left-over from a debugging apparently I am not so sure about that; the `objArr` is later ignored by the `VM_Crac` so I thought this it the place that should report open FDs, even though claimed by Java. ------------- PR: https://git.openjdk.org/crac/pull/43 From rvansa at azul.com Mon Mar 20 10:53:05 2023 From: rvansa at azul.com (Radim Vansa) Date: Mon, 20 Mar 2023 11:53:05 +0100 Subject: Proposal for documentation and snapsafety In-Reply-To: <16df8468-400a-c238-97f5-90ded3b1856a@azul.com> References: <1c0a8870-371f-fe2e-d0a8-f5df241a842d@azul.com> <16df8468-400a-c238-97f5-90ded3b1856a@azul.com> Message-ID: On 15. 03. 23 16:36, Anton Kozlov wrote: > On 2/27/23 11:01, Radim Vansa wrote: >> While all JDK code can be eventually fixed in a similar way to >> SecureRandom I >> think that it's clear that not everything can be encapsulated. A good >> example >> are the environment variables, but also number of processors and many >> others >> [2]. > > It looks the problem here is not technical correctness, but still the > user > experience, right?? I.e. by providing good documentation (javadoc == > Java SE > API), we can provide a good level of specification of what to expect on > checkpoint and on restore.? But adhering to that specification is > complicated > for users, as it is still a hard task to find uses of a method with a > semantic > that has been changed in CRaC.? Is this understanding of the problem > correct? Yes, though I would like to stress out that these usages might occur in legacy/3rd party code the user has little knowledge about. > >> I propose to tag any method/constructor that returns data that could be >> expected to stay constant in non-C/R app but often changes after >> restore, or >> an object that will need handling through a registered Resource, with >> @CracSensitive (3) annotation.? We will provide a tool that will report >> places that could call these methods, unless marked with @CracSafe >> annotation. This tool could work in a static way (scanning set of JARs, >> probably with a thin Maven plugin as well?) and as a javaagent, scanning >> classes as these get loaded. >> >> Naturally not all code invoking non-snap-safe methods is from user >> code, many >> cases come from the libraries. Alternative way to allow-list places >> calling >> @CracSensitive methods in places that cannot be changed directly >> would be >> provided, though eventually we aim at encouraging that the libraries >> adopt >> the @CracSafe internally. > > While technically this is possible, there are a few drawbacks IMHO.? > First, the > tool and annotations are interdependent, although the dependency of > annotations > on the tool is implicit.? But anyway, annotations do not make any > sense without > the tool checking them.? So, either the tool and annotations are > somehow should > be completely external to the JDK, or both of them should be in the JDK. I agree; I assumed that the core of the tool would live in JDK, though for practical reasons there would be external integrations (e.g. Maven plugin) outside JDK repo. Based on our offline discussion I agree that listing the crac-sensitive methods externally would work. I was considering to piggy-back on the free-form contents of @SuppressWarnings annotation (making sure that "crac" or a similar string is accepted) to implement this rather than standalone annotations, but this one has source-level retention, therefore it can't be used if the tool is supposed to analyze usages in dependencies, too. > > But, I'm not sure the tool is the best approach.? That does not take > advantage > of being able to track exact calls of the annotated methods before the > checkpoint and after restore.? For example, querying the number of > processors > is fine if happens after the restore.? So the tool would need somehow to > distinguish calls of annotated methods before checkpoint (where > previously > returned results may become obsolete) and those after restore, > otherwise, there > will be some number of false positives, and those false positives > would require > some way to silence them after consideration.? Also, even before the > checkpoint, having a call in the code does not mean that will be actually > called e.g. because of some specific configuration that disables > detection of > the number of processes. So it seems without pretty complex static > dataflow > analysis we'll have another source of false positives. I never intended to perform complex analysis in the tooling, that would turn into a can of worms. False positives is rather misleading term in here as the tool is not supposed to provide a definitive guidance, but let's not spend time on nomenclature. Silencing the possible problems is the integral part of the proposal, that's what the @CracSafe would be used for. This line of thought led me to questioning if the API (e.g. Core class) could expose a method to find the C/R generation (e.g. 0 = before first checkpoint, 1 = after restore, 2 = after restore from second checkpoint once we support that in the implementation...) to be able to write assertions based on that (e.g. method should not be called before checkpoint). It's very simple to implement a resource tracking this on your own, but a standard way might be better. > > Have you considered taking advantage of actually running the program?? > E.g. > recording stack traces for methods calls and reporting them on the > checkpoint, > like in PR #43 [1].? Compared to the separate tool, the call recording > reports > only calls that have happened, and only before the checkpoint. The > stack trace > provides some information about how the result will be used (although not > complete info on how the result of the method is going to be used).? The > implementation will probably be very simple, and by some convention, > we can > agree on a way to exclude some stack traces from reporting, e.g. by > having a > specific stack trace element. > > Does the tool have an advantage over the recording of method calls and > stack > traces? I consider being able to run ahead a big advantage; this is probably a personal preference and depends on how much the checkpoint-restore scenario would be tested in practice - I expect rather optimistic approach during adoption that will skip many cases that could happen in practice. A runtime-checking system is definitely possible, too, and these two solutions could complement each other, leaving the choice up to the user. It would be useful to share the 'declarative part' (marking problematic methods and verified usages), though. Checking at runtime would give you more details, less 'false positives' but also no guarantees for code paths not executed. Any runtime checker implementation should make sure, though, that in the production there's no impact. You're right that javaagent checking loaded code could be too much of a middle-ground, we could instrument the call sites to go through validating proxies in this mode. Radim > > [1] https://github.com/openjdk/crac/pull/43 > > Thanks, > Anton From akozlov at openjdk.org Mon Mar 20 13:46:49 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 20 Mar 2023 13:46:49 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v4] In-Reply-To: References: <4wkAuIGHgohQBmSVv1ot-kU3aQfRGI_Vqss610uAFsk=.298e2342-b56a-444f-b7d5-dd445b53dae9@github.com> Message-ID: On Mon, 20 Mar 2023 09:56:02 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/Core.java line 138: >> >>> 136: fdArr[i] = claimedPairs.get(i).getKey(); >>> 137: objArr[i] = claimedPairs.get(i).getValue(); >>> 138: System.out.printf("%d %s\n", fdArr[i], objArr[i]); >> >> Left-over from a debugging apparently > > I am not so sure about that; the `objArr` is later ignored by the `VM_Crac` so I thought this it the place that should report open FDs, even though claimed by Java. I see, reporting here could be useful, but then it'd better use a Logger or something like that. BTW this was implemented in ad-hoc manner in https://github.com/openjdk/crac/pull/38/files, these should be unified. ------------- PR: https://git.openjdk.org/crac/pull/43 From duke at openjdk.org Mon Mar 20 13:54:57 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 20 Mar 2023 13:54:57 GMT Subject: [crac] Integrated: Convert CRaC tests from shell scripts to Java In-Reply-To: References: Message-ID: On Fri, 24 Feb 2023 09:20:05 GMT, Radim Vansa wrote: > Right now this is a draft as it builds on top of https://github.com/openjdk/crac/pull/47 > > Please see `CracTest` javadoc for detailed info. This pull request has now been integrated. Changeset: 2c83b9c3 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/2c83b9c3c7b629a16b24de42fd527018fe598b5f Stats: 3601 lines in 54 files changed: 1693 ins; 1674 del; 234 mod Convert CRaC tests from shell scripts to Java Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/50 From rmarchenko at openjdk.org Mon Mar 20 15:15:21 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Mon, 20 Mar 2023 15:15:21 GMT Subject: [crac] RFR: Implement CRaC exit codes Message-ID: <08Y5ItCJ6hUexw9wA_q3B-759KqEXzvvE-CawT-wbs8=.44a46cb5-ec66-47a0-9862-f2670a957804@github.com> Implement CRaC exit codes in case it catches a signal. This is a continuation of #46 inspired by [this comment](https://github.com/openjdk/crac/pull/46#discussion_r1122014722) So, in case the child is terminated by a signal, CRaC should be terminated in similar way too, if possible. Otherwise it returns 128+n exit code as bash does. ------------- Commit messages: - Implented re-raising signals to terminate a parent process Changes: https://git.openjdk.org/crac/pull/52/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=52&range=00 Stats: 22 lines in 2 files changed: 22 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/52.diff Fetch: git fetch https://git.openjdk.org/crac pull/52/head:pull/52 PR: https://git.openjdk.org/crac/pull/52 From akozlov at openjdk.org Mon Mar 20 15:24:14 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 20 Mar 2023 15:24:14 GMT Subject: [crac] RFR: Implement CRaC exit codes In-Reply-To: <08Y5ItCJ6hUexw9wA_q3B-759KqEXzvvE-CawT-wbs8=.44a46cb5-ec66-47a0-9862-f2670a957804@github.com> References: <08Y5ItCJ6hUexw9wA_q3B-759KqEXzvvE-CawT-wbs8=.44a46cb5-ec66-47a0-9862-f2670a957804@github.com> Message-ID: On Mon, 20 Mar 2023 15:08:18 GMT, Roman Marchenko wrote: > Implement CRaC exit codes in case it catches a signal. > > This is a continuation of #46 inspired by [this comment](https://github.com/openjdk/crac/pull/46#discussion_r1122014722) > > So, in case the child is terminated by a signal, CRaC should be terminated in similar way too, if possible. Otherwise it returns 128+n exit code as bash does. LGTM, thanks! ------------- PR: https://git.openjdk.org/crac/pull/52 From akozlov at openjdk.org Tue Mar 21 12:55:25 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 21 Mar 2023 12:55:25 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v5] In-Reply-To: References: <5anBWxOXjtuZhYx0Xl7fhDqnGiosvxdLdQJ_WsEVDGc=.994f145f-f2cb-4355-b4d9-aa251c7676cd@github.com> Message-ID: On Tue, 7 Mar 2023 14:35:04 GMT, Radim Vansa wrote: >> test/jdk/jdk/crac/java/net/InetAddress/ResolveTest.java line 105: >> >>> 103: // not be able to restore the process. >>> 104: // This workaround should not be necessary when https://github.com/openjdk/crac/pull/46 >>> 105: // is integrated. >> >> I propose to wait until #46 is integrated and remove the mention (and workaround) > > I would integrate code when it's ready reagardless of other PRs (haven't actually checked if this works with #46). This code will be anyway rewritten with #50 Could you update the code, since #46 and #50 are merged? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/48#discussion_r1143327394 From akozlov at azul.com Tue Mar 21 14:25:07 2023 From: akozlov at azul.com (Anton Kozlov) Date: Tue, 21 Mar 2023 16:25:07 +0200 Subject: Project CRaC Guidelines (draft) Message-ID: Please find a draft of Project CRaC Guidelines below. They aim to provide a programmer-centric scope of the project, as well as to set a foundation for programming model discussions, like [1]. [1] https://github.com/openjdk/crac/pull/30 Thanks, Anton CRaC Project Guidelines ======================= 0. CRaC Project suggests a way to solve real Java problems with start-up and warm-up with checkpoint-restore mechanism, which implies a new programming model, and a new Java API to support the model. The new programming model is not very different to the conventional one. The difference is that the Java Machine can be voluntarily paused from execution, with upfront notification before the pause ("checkpoint") and after the pause is complete ("restore"). While in the pause, the environment where the VM executes can change in arbitrary way, the VM can also be replicated. The new API should provide a way for programs to prepare for the checkpoint and to be able to react to environmental changes after the restore. 1. CRaC Project is to develop a new Coordination API that provides all components of Java VM, the standard library, and an application with notifications. The API is intended to be simple and conservative. It should be abstract enough to allow different CRaC implementations in different environments. 2. The project also is to develop a reference implementation (RI). 2.1. The RI should approach solving real problems with start-up and warm-up. Although being the first user of the API, the implementation does not need to be conservative. Having that no existing computing environment is similar to checkpoint-restore, CRaC implementation to develop a set of assumptions and constraints for programming, and to communicate these assumptions and constraints to users. 2.2. The implementation should demonstrate the best practices of using the new API and demonstrace approaches for common problems. From duke at openjdk.org Thu Mar 23 07:43:08 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 07:43:08 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v6] In-Reply-To: References: Message-ID: <0RxdBRuX81LtkiThnX4itopWi0PjHuyi91FHvkT5kdE=.5c959db3-ceb6-49ef-ba1c-d168fe5152af@github.com> > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 22 commits: - Merge remote-tracking branch 'origin/crac' into test-crac-in-gha - Finally fix ResolveTest - Add seemingly non-vital logs - Revert all debug info and hope for the best - what's wrong with cppath - Fix path - fix path - Fix criu SHA256 - Print dump log - Use newer criu - ... and 12 more: https://git.openjdk.org/crac/compare/2c83b9c3...e9b8f717 ------------- Changes: https://git.openjdk.org/crac/pull/48/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=48&range=05 Stats: 39 lines in 2 files changed: 39 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/48.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/48/head:pull/48 PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Thu Mar 23 15:06:24 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 15:06:24 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v5] In-Reply-To: References: Message-ID: > Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. > File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 19 commits: - Delete LazyProps test (debug logging now configured differently) - Do not allow registering resources with lower or equal priority during beforeCheckpoint - Address review comments - Merge remote-tracking branch 'origin/crac' into newfd - Use descriptor access rather than extending API - 8272472: StackGuardPages test doesn't build with glibc 2.34 Backport-of: f77a1a156f3da9068d012d9227c7ee0fee58f571 - Empty commit to trigger GHA - Drop native FDs tracking - Avoid claiming invalid FileDescriptor - Whitelist RandomAccessFile opening classpath files This is a workaround for some frameworks opening classpath files in a non-standard way. - ... and 9 more: https://git.openjdk.org/crac/compare/2c83b9c3...d6757344 ------------- Changes: https://git.openjdk.org/crac/pull/43/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=43&range=04 Stats: 763 lines in 29 files changed: 389 ins; 293 del; 81 mod Patch: https://git.openjdk.org/crac/pull/43.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/43/head:pull/43 PR: https://git.openjdk.org/crac/pull/43 From duke at openjdk.org Thu Mar 23 15:06:26 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 15:06:26 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v4] In-Reply-To: References: Message-ID: <29XKFWErP6ox23xj908dd39fZiXUou5KBhZzDyhuhfE=.b284bd20-0ef6-4ff3-bc92-8993ff59e756@github.com> On Thu, 16 Feb 2023 11:18:34 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Use descriptor access rather than extending API > - 8272472: StackGuardPages test doesn't build with glibc 2.34 > > Backport-of: f77a1a156f3da9068d012d9227c7ee0fee58f571 > - Empty commit to trigger GHA > - Drop native FDs tracking > - Avoid claiming invalid FileDescriptor > - Whitelist RandomAccessFile opening classpath files > > This is a workaround for some frameworks opening classpath files in > a non-standard way. > - Add tracking of FD origin > - Track FileDescriptors closed by NIO > - Track native FDs from EPoll > - ... and 6 more: https://git.openjdk.org/crac/compare/5925959f...9b5e7edd @AntonKozlov I had to add the changes for locking of resources registered during beforeCheckpoint already into this commit; when running the test I've ran into deadlocks. ------------- PR Comment: https://git.openjdk.org/crac/pull/43#issuecomment-1481354315 From duke at openjdk.org Thu Mar 23 15:21:53 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 15:21:53 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v7] In-Reply-To: References: Message-ID: > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Let ReseedTest use stdout rather than exit codes There was a 1:256 chance of test randomly failing because the random generators produced the same number, despite differently seeded. ------------- Changes: - all: https://git.openjdk.org/crac/pull/48/files - new: https://git.openjdk.org/crac/pull/48/files/e9b8f717..704f50a4 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=48&range=06 - incr: https://webrevs.openjdk.org/?repo=crac&pr=48&range=05-06 Stats: 5 lines in 1 file changed: 1 ins; 0 del; 4 mod Patch: https://git.openjdk.org/crac/pull/48.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/48/head:pull/48 PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Thu Mar 23 15:47:02 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 15:47:02 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore Message-ID: There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. ------------- Commit messages: - Merge remote-tracking branch 'origin/crac' into nanotime - Adjust System.nanoTime() to keep consistent time origin after restore - Merge remote-tracking branch 'origin/crac' into test-crac-java - Remove test name from the @run JTreg tag - Use default main and args from CracTest - Merge remote-tracking branch 'origin/crac' into test-crac-java - Add docker to CracBuilder - Rename enum for `simengine` to SIMULATE - Convert CRaC tests from shell scripts to Java - Remove somebody's forgotten overrides - ... and 1 more: https://git.openjdk.org/crac/compare/2c83b9c3...b7f54604 Changes: https://git.openjdk.org/crac/pull/53/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=53&range=00 Stats: 152 lines in 7 files changed: 150 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/53.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/53/head:pull/53 PR: https://git.openjdk.org/crac/pull/53 From duke at openjdk.org Thu Mar 23 15:48:18 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 15:48:18 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore Message-ID: There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. ------------- Commit messages: - Merge remote-tracking branch 'origin/crac' into nanotime - Adjust System.nanoTime() to keep consistent time origin after restore - Merge remote-tracking branch 'origin/crac' into test-crac-java - Remove test name from the @run JTreg tag - Use default main and args from CracTest - Merge remote-tracking branch 'origin/crac' into test-crac-java - Add docker to CracBuilder - Rename enum for `simengine` to SIMULATE - Convert CRaC tests from shell scripts to Java - Remove somebody's forgotten overrides - ... and 1 more: https://git.openjdk.org/crac/compare/2c83b9c3...b7f54604 Changes: https://git.openjdk.org/crac/pull/53/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=53&range=00 Stats: 152 lines in 7 files changed: 150 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/53.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/53/head:pull/53 PR: https://git.openjdk.org/crac/pull/53 From duke at openjdk.org Thu Mar 23 16:19:40 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 16:19:40 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures Message-ID: Some parts of JDK expect that #cpus is constant; update those places after restore. ------------- Commit messages: - Allow changes to NCPU - Resize ForkJoinPools when number of CPUs changes Changes: https://git.openjdk.org/crac/pull/54/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=54&range=00 Stats: 322 lines in 7 files changed: 292 ins; 17 del; 13 mod Patch: https://git.openjdk.org/crac/pull/54.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/54/head:pull/54 PR: https://git.openjdk.org/crac/pull/54 From duke at openjdk.org Thu Mar 23 17:17:45 2023 From: duke at openjdk.org (Ashutosh Mehra) Date: Thu, 23 Mar 2023 17:17:45 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 15:38:35 GMT, Radim Vansa wrote: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. src/hotspot/share/runtime/os.cpp line 2051: > 2049: diff_millis = 0; > 2050: } > 2051: javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos() + diff_millis * 1000000L; Can you please explain why isn't `javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos()` sufficient? What am I missing here? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1146521174 From heidinga at openjdk.org Thu Mar 23 17:34:46 2023 From: heidinga at openjdk.org (Dan Heidinga) Date: Thu, 23 Mar 2023 17:34:46 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 16:10:44 GMT, Radim Vansa wrote: > Some parts of JDK expect that #cpus is constant; update those places after restore. @rvansa This was discussed on the list back in May 2022 [0] and it was noted that Doug Lea updated FJP to support changing the number of threads using a new ::setParallelism(int) [1] method. Is this patch a backport of that work? Backporting the FJP changes to the CRaC branch and basing our updates on the ::setParallelism method will provide the best continuity to users as they move to using CRaC on newer releases. [0] https://mail.openjdk.org/pipermail/crac-dev/2022-May/000229.html [1] https://github.com/openjdk/crac/blob/f235955eefb1141a2e72116dfcf345e40416f059/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java#L2938 ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1481603976 From duke at openjdk.org Thu Mar 23 17:41:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 17:41:38 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 17:32:18 GMT, Dan Heidinga wrote: >> Some parts of JDK expect that #cpus is constant; update those places after restore. > > @rvansa This was discussed on the list back in May 2022 [0] and it was noted that Doug Lea updated FJP to support changing the number of threads using a new ::setParallelism(int) [1] method. Is this patch a backport of that work? > > Backporting the FJP changes to the CRaC branch and basing our updates on the ::setParallelism method will provide the best continuity to users as they move to using CRaC on newer releases. > > [0] https://mail.openjdk.org/pipermail/crac-dev/2022-May/000229.html > [1] https://github.com/openjdk/crac/blob/f235955eefb1141a2e72116dfcf345e40416f059/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java#L2938 @DanHeidinga No, this is my original work. I went through the changes in more recent JDK and those changes were far from trivial, therefore a backport would not be straightforward. If CRaC is rebased on a more recent JDK we should scratch these changes and use the `setParallelism` instead; this PR intends to provide a minimalistic solution in JDK 17. ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1481613726 From duke at openjdk.org Thu Mar 23 17:47:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 17:47:12 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 17:15:12 GMT, Ashutosh Mehra wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > src/hotspot/share/runtime/os.cpp line 2051: > >> 2049: diff_millis = 0; >> 2050: } >> 2051: javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos() + diff_millis * 1000000L; > > Can you please explain why isn't `javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos()` sufficient? What am I missing here? @ashu-mehra Had we ignored the `diff_millis` part, the diff of nanotime before and after checkpoint would indicate that no time elapsed between checkpoint and restore. With this in place the difference does not have the expected accuracy but at least it's monotonic; wall clock time difference gives us the best possible estimate. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1146570603 From duke at openjdk.org Thu Mar 23 17:51:47 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 23 Mar 2023 17:51:47 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA In-Reply-To: References: Message-ID: On Wed, 1 Mar 2023 15:25:42 GMT, Anton Kozlov wrote: >> Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. >> Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 > > Please use merge. This will preserve all changes in this PR, and on the eventual integration Skara will squash all changes anyway. @AntonKozlov I've merged in recent changes and resolved the flakiness in ReseedTest. GHA now fails due to infrastructure timeout, the `crac` part of testsuite is green. ------------- PR Comment: https://git.openjdk.org/crac/pull/48#issuecomment-1481628442 From akozlov at openjdk.org Thu Mar 23 17:56:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 23 Mar 2023 17:56:56 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v7] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 15:21:53 GMT, Radim Vansa wrote: >> Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. >> Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Let ReseedTest use stdout rather than exit codes > > There was a 1:256 chance of test randomly failing because the random > generators produced the same number, despite differently seeded. LGTM, thank you! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/48#pullrequestreview-1355218847 From rmarchenko at openjdk.org Fri Mar 24 08:12:08 2023 From: rmarchenko at openjdk.org (Roman Marchenko) Date: Fri, 24 Mar 2023 08:12:08 GMT Subject: [crac] Integrated: Implement CRaC exit codes In-Reply-To: <08Y5ItCJ6hUexw9wA_q3B-759KqEXzvvE-CawT-wbs8=.44a46cb5-ec66-47a0-9862-f2670a957804@github.com> References: <08Y5ItCJ6hUexw9wA_q3B-759KqEXzvvE-CawT-wbs8=.44a46cb5-ec66-47a0-9862-f2670a957804@github.com> Message-ID: On Mon, 20 Mar 2023 15:08:18 GMT, Roman Marchenko wrote: > Implement CRaC exit codes in case it catches a signal. > > This is a continuation of #46 inspired by [this comment](https://github.com/openjdk/crac/pull/46#discussion_r1122014722) > > So, in case the child is terminated by a signal, CRaC should be terminated in similar way too, if possible. Otherwise it returns 128+n exit code as bash does. This pull request has now been integrated. Changeset: 3ff95be7 Author: Roman Marchenko Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/3ff95be7ac8a4db6a07f37ff96243bb1d2cbab48 Stats: 22 lines in 2 files changed: 22 ins; 0 del; 0 mod Implement CRaC exit codes ------------- PR: https://git.openjdk.org/crac/pull/52 From akozlov at openjdk.org Fri Mar 24 08:24:05 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 24 Mar 2023 08:24:05 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v7] In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 15:21:53 GMT, Radim Vansa wrote: >> Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. >> Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Let ReseedTest use stdout rather than exit codes > > There was a 1:256 chance of test randomly failing because the random > generators produced the same number, despite differently seeded. test/jdk/jdk/crac/SecureRandom/ReseedTest.java line 75: > 73: > 74: System.out.println(sr.nextInt()); > 75: System.exit(0); This change will also help when something is wrong with the restore, the previous version might confuse that situation with legitimate return with non-zero error. On the second thought, could you extract this into a separate PR? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/48#discussion_r1147247593 From duke at openjdk.org Fri Mar 24 09:35:04 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 09:35:04 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v6] In-Reply-To: References: Message-ID: <0qJav6bg_RE1VF1m5-Qd-VgwTXwBil4XDUAT_eOkDRg=.eb2a0837-105a-4606-93e1-a86ce0333faf@github.com> > Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. > File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Move properties to inner class, initialized later ------------- Changes: - all: https://git.openjdk.org/crac/pull/43/files - new: https://git.openjdk.org/crac/pull/43/files/d6757344..3a13fe35 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=43&range=05 - incr: https://webrevs.openjdk.org/?repo=crac&pr=43&range=04-05 Stats: 74 lines in 5 files changed: 69 ins; 0 del; 5 mod Patch: https://git.openjdk.org/crac/pull/43.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/43/head:pull/43 PR: https://git.openjdk.org/crac/pull/43 From duke at openjdk.org Fri Mar 24 10:23:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 10:23:09 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v7] In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 08:21:27 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Let ReseedTest use stdout rather than exit codes >> >> There was a 1:256 chance of test randomly failing because the random >> generators produced the same number, despite differently seeded. > > test/jdk/jdk/crac/SecureRandom/ReseedTest.java line 75: > >> 73: >> 74: System.out.println(sr.nextInt()); >> 75: System.exit(0); > > This change will also help when something is wrong with the restore, the previous version might confuse that situation with legitimate return with non-zero error. > > On the second thought, could you extract this into a separate PR? If you promise me that you'll integrate right away, I can :) ------------- PR Review Comment: https://git.openjdk.org/crac/pull/48#discussion_r1147384676 From duke at openjdk.org Fri Mar 24 10:31:53 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 10:31:53 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v8] In-Reply-To: References: Message-ID: > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Revert "Let ReseedTest use stdout rather than exit codes" This reverts commit 704f50a4ea9591ae1dd04aae7559492441ea4c2e. ------------- Changes: - all: https://git.openjdk.org/crac/pull/48/files - new: https://git.openjdk.org/crac/pull/48/files/704f50a4..9359b6b0 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=48&range=07 - incr: https://webrevs.openjdk.org/?repo=crac&pr=48&range=06-07 Stats: 5 lines in 1 file changed: 0 ins; 1 del; 4 mod Patch: https://git.openjdk.org/crac/pull/48.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/48/head:pull/48 PR: https://git.openjdk.org/crac/pull/48 From duke at openjdk.org Fri Mar 24 10:31:56 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 10:31:56 GMT Subject: [crac] RFR: Add CRaC-specific tests to GHA [v7] In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 10:20:41 GMT, Radim Vansa wrote: >> test/jdk/jdk/crac/SecureRandom/ReseedTest.java line 75: >> >>> 73: >>> 74: System.out.println(sr.nextInt()); >>> 75: System.exit(0); >> >> This change will also help when something is wrong with the restore, the previous version might confuse that situation with legitimate return with non-zero error. >> >> On the second thought, could you extract this into a separate PR? > > If you promise me that you'll integrate right away, I can :) Here it goes: https://github.com/openjdk/crac/pull/55 I've reverted the commit in this PR ------------- PR Review Comment: https://git.openjdk.org/crac/pull/48#discussion_r1147390843 From duke at openjdk.org Fri Mar 24 10:32:10 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 10:32:10 GMT Subject: [crac] RFR: Let ReseedTest use stdout rather than exit codes Message-ID: There was a 1:256 chance of test randomly failing because the random generators produced the same number, despite differently seeded. ------------- Commit messages: - Let ReseedTest use stdout rather than exit codes Changes: https://git.openjdk.org/crac/pull/55/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=55&range=00 Stats: 5 lines in 1 file changed: 1 ins; 1 del; 3 mod Patch: https://git.openjdk.org/crac/pull/55.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/55/head:pull/55 PR: https://git.openjdk.org/crac/pull/55 From akozlov at openjdk.org Fri Mar 24 12:12:31 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 24 Mar 2023 12:12:31 GMT Subject: [crac] RFR: Let ReseedTest use stdout rather than exit codes In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 10:25:41 GMT, Radim Vansa wrote: > There was a 1:256 chance of test randomly failing because the random generators produced the same number, despite differently seeded. LGTM, thanks! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/55#pullrequestreview-1356574174 From heidinga at openjdk.org Fri Mar 24 12:46:03 2023 From: heidinga at openjdk.org (Dan Heidinga) Date: Fri, 24 Mar 2023 12:46:03 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 16:10:44 GMT, Radim Vansa wrote: > Some parts of JDK expect that #cpus is constant; update those places after restore. I agree that the changes Doug made are "far from trivial" which is due to the complexity of the problem space. Changing the # of cpus at runtime is those algorithms is a tricky problem which makes me hesitant to review a minimal solution. Have you considered backporting the entire set of j.u.c.* classes that Doug updated in the mainline to jdk17? That would probably provide a more consistent solution and any bugs would be shared with the mainline ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1482739602 From duke at openjdk.org Fri Mar 24 13:58:06 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 13:58:06 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 16:10:44 GMT, Radim Vansa wrote: > Some parts of JDK expect that #cpus is constant; update those places after restore. If those changes are backported it should happen in 17-dev and only then propagate to CRaC, otherwise it stops being clear what contributes to CRaC implementation. Do you think the backport would be accepted in main 17-dev when it extends the API with another method? ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1482842107 From duke at openjdk.org Fri Mar 24 14:29:05 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 14:29:05 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v7] In-Reply-To: References: Message-ID: <8g6_fAk2sVgQ18WRgijvQ6DUmil-F6GHDTnY23xozUs=.4451696e-6778-4edb-b775-4e351fabb788@github.com> > Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. > File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Fix logging formatter ------------- Changes: - all: https://git.openjdk.org/crac/pull/43/files - new: https://git.openjdk.org/crac/pull/43/files/3a13fe35..4f506168 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=43&range=06 - incr: https://webrevs.openjdk.org/?repo=crac&pr=43&range=05-06 Stats: 7 lines in 2 files changed: 5 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/43.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/43/head:pull/43 PR: https://git.openjdk.org/crac/pull/43 From heidinga at openjdk.org Fri Mar 24 14:36:06 2023 From: heidinga at openjdk.org (Dan Heidinga) Date: Fri, 24 Mar 2023 14:36:06 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 13:55:22 GMT, Radim Vansa wrote: > If those changes are backported it should happen in 17-dev and only then propagate to CRaC, otherwise it stops being clear what contributes to CRaC implementation. Do you think the backport would be accepted in main 17-dev when it extends the API with another method? I'm not a 17-dev committer or maintainer so please take my response with a grain of salt. I'd be surprised if there was support for backporting these changes to 17-dev given they don't address an issue commonly experienced by jdk17 users and this would be effectively adding new functionality to a long-since shipped release which may change performance / behaviour of a critical component of the release. Having said that, I still think it makes sense to back port them to the CRaC project as we'll need to eventually merge CRaC into the mainline development branches. That gives us more leeway to take on extra patches that ease that eventual merge. ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1482903223 From duke at openjdk.org Fri Mar 24 15:39:26 2023 From: duke at openjdk.org (Ashutosh Mehra) Date: Fri, 24 Mar 2023 15:39:26 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 16:10:44 GMT, Radim Vansa wrote: > Some parts of JDK expect that #cpus is constant; update those places after restore. Reading up the discussion on [FJP::setParallelism](https://mail.openjdk.org/pipermail/crac-dev/2022-May/000229.html), it was suggested [here](https://mail.openjdk.org/pipermail/crac-dev/2022-May/000230.html) to have crac branch receiving updates from openjdk/jdk/master, and crac-17u from openjdk/jdk17u/master. If we go with that plan, then this PR can be targeted to crac-17u while the crac branch can make use of Doug's patch. Although it would fragment the implementation across the two branches, but as long as they are providing the same functionality, I guess we can live with that. Does that make sense? ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1483007502 From duke at openjdk.org Fri Mar 24 16:51:39 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 16:51:39 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 15:36:30 GMT, Ashutosh Mehra wrote: >> Some parts of JDK expect that #cpus is constant; update those places after restore. > > Reading up the discussion on [FJP::setParallelism](https://mail.openjdk.org/pipermail/crac-dev/2022-May/000229.html), it was suggested [here](https://mail.openjdk.org/pipermail/crac-dev/2022-May/000230.html) to have crac branch receiving updates from openjdk/jdk/master, and crac-17u from openjdk/jdk17u/master. > > If we go with that plan, then this PR can be targeted to crac-17u while the crac branch can make use of Doug's patch. > Although it would fragment the implementation across the two branches, but as long as they are providing the same functionality, I guess we can live with that. Does that make sense? @ashu-mehra that was my intention; I expect these changes to be overwritten when main crac rebases to later JDK, and keep them only in crac-17u. @DanHeidinga I would expect this kind of reaction for 17 as well, but since I am not so positive in maintaining them in CRaC I went rather for this independent PR. There won't be a perfect solution; let's see what other stakeholders would have to say. ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1483109437 From duke at openjdk.org Fri Mar 24 18:48:18 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 18:48:18 GMT Subject: [crac] Integrated: Let ReseedTest use stdout rather than exit codes In-Reply-To: References: Message-ID: On Fri, 24 Mar 2023 10:25:41 GMT, Radim Vansa wrote: > There was a 1:256 chance of test randomly failing because the random generators produced the same number, despite differently seeded. This pull request has now been integrated. Changeset: 47b87233 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/47b872335f7513b6d71495c4aba8144496e2a973 Stats: 5 lines in 1 file changed: 1 ins; 1 del; 3 mod Let ReseedTest use stdout rather than exit codes Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/55 From duke at openjdk.org Fri Mar 24 18:50:11 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 24 Mar 2023 18:50:11 GMT Subject: [crac] Integrated: Add CRaC-specific tests to GHA In-Reply-To: References: Message-ID: <86bfCK70HBO6r21qtagXGyDQBlwdNi3rJdBnBzAWSE8=.77f1a1dd-4f39-48b0-8cb0-b8401f2a04c6@github.com> On Wed, 22 Feb 2023 15:08:29 GMT, Radim Vansa wrote: > Existing GitHub Actions run test tier1 but since most changes in this project focus on the CRaC capabilities we should run them in an automated fashion, too. > Right now the tests are mostly failing: this should be addressed in https://github.com/openjdk/crac/pull/47 This pull request has now been integrated. Changeset: f9fed1e4 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/f9fed1e4bcfd255dae43a72bfd6a1a2d412f2808 Stats: 39 lines in 2 files changed: 39 ins; 0 del; 0 mod Add CRaC-specific tests to GHA Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/48 From akozlov at openjdk.org Fri Mar 24 19:07:58 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 24 Mar 2023 19:07:58 GMT Subject: [crac] RFR: Harden criuengine cppath reading Message-ID: On some older OSes I see a few `fgets error` coming from criuengine restore function. They are intermittent and hard to debug. After replacing libc fopen/fgets invocations with open/read, the problem went away. I'm still not completely sure why the problem with libc file functions exists but I suspect that EINTR is not correctly handled there, or the fact we store a sequence in characters without `\0` or `\n` in the file we read. So I propose using the lower-level interface to read the file. ------------- Commit messages: - Merge remote-tracking branch 'jdk/crac/crac' into crac - Harden criuengine cppath reading Changes: https://git.openjdk.org/crac/pull/56/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=56&range=00 Stats: 19 lines in 1 file changed: 11 ins; 2 del; 6 mod Patch: https://git.openjdk.org/crac/pull/56.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/56/head:pull/56 PR: https://git.openjdk.org/crac/pull/56 From akozlov at openjdk.org Mon Mar 27 12:41:04 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 27 Mar 2023 12:41:04 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: Message-ID: <00MK7rssGlw-v9hyyguLD7nGXivN3nFnhe0lr1Sl21g=.ee70ea0c-a073-453f-ad98-5076d2edae6e@github.com> On Thu, 23 Mar 2023 16:10:44 GMT, Radim Vansa wrote: > Some parts of JDK expect that #cpus is constant; update those places after restore. >From a purely technical point of view, backporting those changes from upstream provides the CRaC implementation better understood, tested, etc. Imaging we have a space like crac-17u (jdk17u with CRaC changes), and a policy allowing backporting something extra compared to jdk17u, would we need a special CRaC version to resize FJP or backports would be enough? >From a purely technical point of view, backporting those changes from upstream provides the CRaC implementation better understood, tested, etc. Imaging we have a space like crac-17u (jdk17u with CRaC changes), and a policy allowing backporting something extra compared to jdk17u, would we need a special CRaC version to resize FJP or backports would be enough? ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1485054289 PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1485054744 From heidinga at openjdk.org Mon Mar 27 13:02:01 2023 From: heidinga at openjdk.org (Dan Heidinga) Date: Mon, 27 Mar 2023 13:02:01 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: <00MK7rssGlw-v9hyyguLD7nGXivN3nFnhe0lr1Sl21g=.ee70ea0c-a073-453f-ad98-5076d2edae6e@github.com> References: <00MK7rssGlw-v9hyyguLD7nGXivN3nFnhe0lr1Sl21g=.ee70ea0c-a073-453f-ad98-5076d2edae6e@github.com> Message-ID: On Mon, 27 Mar 2023 12:37:41 GMT, Anton Kozlov wrote: > From a purely technical point of view, backporting those changes from upstream provides the CRaC implementation better understood, tested, etc. > +1 > Imaging we have a space like crac-17u (jdk17u with CRaC changes), and a policy allowing backporting something extra compared to jdk17u, would we need a special CRaC version to resize FJP or backports would be enough? I think the backports would cover the tricky changes to FJP though we'd still need a JDKResrouce::{beforeCheckpoint, afterCheckpoint} implementation. An ::afterCheckpoint method to call ::setParallelism to match the number of cpu's on the restored system. The ::beforeCheckpoint implementation could be used to resize the pool to an expected number of cpus (default 4?) as an optional optimization. ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1485067770 From heidinga at openjdk.org Mon Mar 27 13:02:01 2023 From: heidinga at openjdk.org (Dan Heidinga) Date: Mon, 27 Mar 2023 13:02:01 GMT Subject: [crac] RFR: Resize ForkJoinPool and some concurrent data structures In-Reply-To: References: <00MK7rssGlw-v9hyyguLD7nGXivN3nFnhe0lr1Sl21g=.ee70ea0c-a073-453f-ad98-5076d2edae6e@github.com> Message-ID: On Mon, 27 Mar 2023 12:59:35 GMT, Dan Heidinga wrote: > From a purely technical point of view, backporting those changes from upstream provides the CRaC implementation better understood, tested, etc. > +1 > Imaging we have a space like crac-17u (jdk17u with CRaC changes), and a policy allowing backporting something extra compared to jdk17u, would we need a special CRaC version to resize FJP or backports would be enough? I think the backports would cover the tricky changes to FJP though we'd still need a JDKResrouce::{beforeCheckpoint, afterCheckpoint} implementation. An ::afterCheckpoint method to call ::setParallelism to match the number of cpu's on the restored system. The ::beforeCheckpoint implementation could be used to resize the pool to an expected number of cpus (default 4?) as an optional optimization. ------------- PR Comment: https://git.openjdk.org/crac/pull/54#issuecomment-1485067908 From akozlov at openjdk.org Mon Mar 27 17:28:22 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 27 Mar 2023 17:28:22 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 15:38:35 GMT, Radim Vansa wrote: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. src/hotspot/share/runtime/os.cpp line 2051: > 2049: diff_millis = 0; > 2050: } > 2051: javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos() + diff_millis * 1000000L; Will `uptime_since_restore` still report the correct time after the change? https://github.com/openjdk/crac/blob/9ed961106a255145274de777e151577863b013ea/src/hotspot/os/linux/os_linux.cpp#L5697 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1149577706 From duke at openjdk.org Mon Mar 27 19:21:06 2023 From: duke at openjdk.org (Ashutosh Mehra) Date: Mon, 27 Mar 2023 19:21:06 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 15:38:35 GMT, Radim Vansa wrote: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. I understand this change is trying to adjust the return value of any calls made to System.nanoTime() after restore to take into account the elapsed time between checkpoint and restore. In principle this idea is very similar to CLOCK_BOOTTIME [0] which takes into account the time system has spent in suspend state. I came across an issue [1] in golang which was suggested to replace CLOCK_MONOTONIC with CLOCK_BOOTTIME but it was considered ill-advised and was closed. There was even a linux kernel patch [2] to make CLOCK_MONOTONIC behave as CLOCK_BOOTTIME which was reverted [3] immediately because it broke many of the existing userland softwares. Considering this precedent, I think we should also consider the impact of this change on existing frameworks and libraries, that is, can this change break the existing code patterns that use System.nanoTime()? [0] https://man7.org/linux/man-pages/man3/clock_gettime.3.html [1] https://github.com/golang/go/issues/24595 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6ed449afdb38f89a7b38ec50e367559e1b8f71f [3] https://www.spinics.net/lists/linux-tip-commits/msg43709.html ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1485734201 From duke at openjdk.org Tue Mar 28 06:56:06 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 28 Mar 2023 06:56:06 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Mon, 27 Mar 2023 17:25:22 GMT, Anton Kozlov wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > src/hotspot/share/runtime/os.cpp line 2051: > >> 2049: diff_millis = 0; >> 2050: } >> 2051: javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos() + diff_millis * 1000000L; > > Will `uptime_since_restore` still report the correct time after the change? > > https://github.com/openjdk/crac/blob/9ed961106a255145274de777e151577863b013ea/src/hotspot/os/linux/os_linux.cpp#L5697 It should, since both times are read after restore. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1150107225 From duke at openjdk.org Tue Mar 28 07:17:04 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 28 Mar 2023 07:17:04 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Mon, 27 Mar 2023 19:18:19 GMT, Ashutosh Mehra wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > I understand this change is trying to adjust the return value of any calls made to System.nanoTime() after restore to take into account the elapsed time between checkpoint and restore. > In principle this idea is very similar to CLOCK_BOOTTIME [0] which takes into account the time system has spent in suspend state. > I came across an issue [1] in golang which was suggested to replace CLOCK_MONOTONIC with CLOCK_BOOTTIME but it was considered ill-advised and was closed. There was even a linux kernel patch [2] to make CLOCK_MONOTONIC behave as CLOCK_BOOTTIME which was reverted [3] immediately because it broke many of the existing userland softwares. > Considering this precedent, I think we should also consider the impact of this change on existing frameworks and libraries, that is, can this change break the existing code patterns that use System.nanoTime()? > > [0] https://man7.org/linux/man-pages/man3/clock_gettime.3.html > [1] https://github.com/golang/go/issues/24595 > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6ed449afdb38f89a7b38ec50e367559e1b8f71f > [3] https://www.spinics.net/lists/linux-tip-commits/msg43709.html @ashu-mehra The main point of this change is *not* about whether the time being suspended should be observed or not; I am rather worried about moving the process to another system and getting totally nonsense results from nanotime diffs, and broken code. I understand that observing the suspended can be a subject to further discussion, though I incline towards the visibility of such interval, as implemented here. Since this fixes some use cases and does not change what wouldn't be broken (on a single system the paused time with system running would be observed anyway unless the whole machine was suspended) I suggest to merge this as-is without considering the topic resolved forever. The fact that some timers use this as the time source rather than wall clock time is an implementation detail. Applications performing checkpoint and restore will require some tweaks to perform correctly and I intend to work on ways to deal with timers. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1486336186 From akozlov at openjdk.org Tue Mar 28 09:35:09 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 28 Mar 2023 09:35:09 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 06:53:21 GMT, Radim Vansa wrote: >> src/hotspot/share/runtime/os.cpp line 2051: >> >>> 2049: diff_millis = 0; >>> 2050: } >>> 2051: javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos() + diff_millis * 1000000L; >> >> Will `uptime_since_restore` still report the correct time after the change? >> >> https://github.com/openjdk/crac/blob/9ed961106a255145274de777e151577863b013ea/src/hotspot/os/linux/os_linux.cpp#L5697 > > It should, since both times are read after restore. restore_start_counter is indeed read on restore, in the process that execs into CREngine, so the value is not adjusted. The value then transfered to the restored VM via SHM [1] and then used for the difference between the javaTimeNanos() which is going to be adjusted. I'm worried the difference between not adjusted and adjusted value may lose sense. Could you extend the NanoTimeTest to demonstrate the `uptime_since_restore` cannot become negative or "too" big? [1] https://github.com/openjdk/crac/commit/9ed961106a255145274de777e151577863b013ea#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6385 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1150304555 From duke at openjdk.org Tue Mar 28 14:24:47 2023 From: duke at openjdk.org (Ashutosh Mehra) Date: Tue, 28 Mar 2023 14:24:47 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 07:14:38 GMT, Radim Vansa wrote: > I am rather worried about moving the process to another system and getting totally nonsense results from nanotime diffs, and broken code. @rvansa I am assuming you are seeing time going backwards with System.nanoTime() calls after restore. Is that correct? Its interesting if you are seeing such absurd results after restore, because IIUC criu is using time namespace to update the clock offsets in `/proc//timens_offsets`, so I wouldn't expect System.nanoTime() to give absurd results on restore. I did some tests with C program that call `clock_gettime(CLOCK_MONOTONIC)` before and after restore. In between I restarted my system. I didn't see time going backwards; all I could observe is that the elapsed time between checkpoint and restore was not taken into account. Can you please provide more details about your observation on nonsense results. Does the system where the absurd behavior is observed support time namepace ? Which version of criu did you use? ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1486989848 From duke at openjdk.org Tue Mar 28 14:33:52 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 28 Mar 2023 14:33:52 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 14:21:20 GMT, Ashutosh Mehra wrote: >> @ashu-mehra The main point of this change is *not* about whether the time being suspended should be observed or not; I am rather worried about moving the process to another system and getting totally nonsense results from nanotime diffs, and broken code. >> >> I understand that observing the suspended can be a subject to further discussion, though I incline towards the visibility of such interval, as implemented here. Since this fixes some use cases and does not change what wouldn't be broken (on a single system the paused time with system running would be observed anyway unless the whole machine was suspended) I suggest to merge this as-is without considering the topic resolved forever. >> >> The fact that some timers use this as the time source rather than wall clock time is an implementation detail. Applications performing checkpoint and restore will require some tweaks to perform correctly and I intend to work on ways to deal with timers. > >> I am rather worried about moving the process to another system and getting totally nonsense results from nanotime diffs, and broken code. > > @rvansa I am assuming you are seeing time going backwards with System.nanoTime() calls after restore. Is that correct? > Its interesting if you are seeing such absurd results after restore, because IIUC criu is using time namespace to update the clock offsets in `/proc//timens_offsets`, so I wouldn't expect System.nanoTime() to give absurd results on restore. > I did some tests with C program that call `clock_gettime(CLOCK_MONOTONIC)` before and after restore. In between I restarted my system. I didn't see time going backwards; all I could observe is that the elapsed time between checkpoint and restore was not taken into account. > > Can you please provide more details about your observation on nonsense results. Does the system where the absurd behavior is observed support time namepace ? Which version of criu did you use? @ashu-mehra To be honest I didn't know about this being handled in CRIU - I thought that the offsets can be set only once in the newly created namespace. Does CRIU create another ns for the process? I have not observed the problem in practice, rather anticipated it as some while ago I ran into similar problem with GraalVM inlining nanotime in a static final variable (caused 100% CPU usage on some machines and normal behaviour on others). I shall rerun the enclosed test without the fix applied but I think it was failing. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1487007327 From duke at openjdk.org Tue Mar 28 14:52:46 2023 From: duke at openjdk.org (Ashutosh Mehra) Date: Tue, 28 Mar 2023 14:52:46 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 14:31:20 GMT, Radim Vansa wrote: > Does CRIU create another ns for the process? If I am reading it correctly the criu does seem to create new time ns [0]. The complete patch can be seen here [1]. [0] https://github.com/checkpoint-restore/criu/blob/7977416ca821b4895fd5dbc498eb95f7e77f2ede/criu/timens.c#L79 [1] https://github.com/checkpoint-restore/criu/commit/4127ef4ab769dc4417c22d0ce0a4ddaaca4193b4 ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1487039286 From duke at openjdk.org Wed Mar 29 14:58:26 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 29 Mar 2023 14:58:26 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: <4Mb_Vz0qmjZDqc_ckDzbJUSo8XqMg5okFXyigwu2FKI=.70a921c4-53be-4e05-93f8-c91e671dd3d5@github.com> On Thu, 23 Mar 2023 15:38:35 GMT, Radim Vansa wrote: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. Thanks for those pointers. I've checked that the NanoTimeTest fails without the 'fix' in place; The dump creates `timens-0.img` (35 bytes) but it looks like this is not used later on; when I run restore with `-v4` I see these log messages: (00.003124) Reading image tree (00.003138) Add mnt ns 6 pid 8 (00.003139) Add net ns 2 pid 8 (00.003140) Add pid ns 1 pid 8 (00.003141) pstree pid_max=8 (00.003143) Will restore in 0 namespaces (00.003144) NS mask to use 0 and I can't find log entry that would report this: pr_debug("timens: monotonic %ld %ld\n", ts.tv_sec, ts.tv_nsec); Any clues why this is not applied? ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1488786017 From duke at openjdk.org Wed Mar 29 15:08:20 2023 From: duke at openjdk.org (Ashutosh Mehra) Date: Wed, 29 Mar 2023 15:08:20 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 15:38:35 GMT, Radim Vansa wrote: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. what is the criu version you are on? ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1488805317 From duke at openjdk.org Thu Mar 30 06:21:05 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 30 Mar 2023 06:21:05 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore In-Reply-To: References: Message-ID: On Thu, 23 Mar 2023 15:38:35 GMT, Radim Vansa wrote: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. I am using this one: https://github.com/CRaC/criu/releases/tag/release-1.3 Version: 3.17.1-crac GitID: 8926431 ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1489756964 From duke at openjdk.org Thu Mar 30 07:55:37 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 30 Mar 2023 07:55:37 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: Message-ID: On Tue, 28 Mar 2023 09:32:15 GMT, Anton Kozlov wrote: >> It should, since both times are read after restore. > > restore_start_counter is indeed read on restore, in the process that execs into CREngine, so the value is not adjusted. The value then transfered to the restored VM via SHM [1] and then used for the difference between the javaTimeNanos() which is going to be adjusted. I'm worried the difference between not adjusted and adjusted value may lose sense. > > Could you extend the NanoTimeTest to demonstrate the `uptime_since_restore` cannot become negative or "too" big? > > [1] https://github.com/openjdk/crac/commit/9ed961106a255145274de777e151577863b013ea#diff-aeec57d804d56002f26a85359fc4ac8b48cfc249d57c656a30a63fc6bf3457adR6385 You were right, I've applied the correction. I also took the liberty of renaming `_restore_start_counter` (and friends) to `_restore_start_nanos` as the 'counter' meaning might be a bit misleading. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1152868423 From duke at openjdk.org Thu Mar 30 07:55:35 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 30 Mar 2023 07:55:35 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: Message-ID: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Correct time since restore ------------- Changes: - all: https://git.openjdk.org/crac/pull/53/files - new: https://git.openjdk.org/crac/pull/53/files/b7f54604..35a9b128 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=53&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=53&range=00-01 Stats: 38 lines in 3 files changed: 9 ins; 9 del; 20 mod Patch: https://git.openjdk.org/crac/pull/53.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/53/head:pull/53 PR: https://git.openjdk.org/crac/pull/53 From akozlov at openjdk.org Thu Mar 30 11:09:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 30 Mar 2023 11:09:56 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Thu, 30 Mar 2023 07:55:35 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Correct time since restore Crac-criu does not use restore timens [1] since once a bug in kernel or criu caused timedwait to return immediatelly everytime that is called after restore. I don't remember the bug exactly (already fixed), but I believe it was discussed on this maillist. In general, we should not to depend on very obscure linux abillities, as this reduce chances we'd be able to run on something rather than linux. Having the code in the JVM provides better control for the implementation the java spec. [1] https://github.com/CRaC/criu/commit/1cb2f4a518a4ae471a1df7a9b540203c1efaf1ba ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1490112327 From akozlov at openjdk.org Thu Mar 30 11:28:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 30 Mar 2023 11:28:56 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Thu, 30 Mar 2023 07:55:35 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Correct time since restore src/hotspot/share/runtime/os.cpp line 2041: > 2039: checkpoint_millis = javaTimeMillis(); > 2040: checkpoint_nanos = javaTimeNanos(); > 2041: } Sorry, I don't follow why for repeated checkpoint-restore it's enough or desireable to store checkpoint_nanos once. Suppose we've done first restore on the same machine that was used for checkpoint (roughly millis diff equals to nanos diff, so time adjustement implemented does not contritube anything for first restore), then checkpoint again. And then perform 2nd restore on another machine in a very short time, suppose immediatelly. Then `checkpoint_nanos - javaTimeNanos()` adjustement can become any value, depending on the difference between clocks on two machines, completely unrelated to each other. src/hotspot/share/runtime/os.cpp line 2051: > 2049: diff_millis = 0; > 2050: } > 2051: javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos() + diff_millis * 1000000L; So all the difference in MONOTONIC clocks is eliminated and replaced with REALITIME estimation, even if the restore was performed on the same machine and the monotonic clocks difference made sense. That may invalidate some measurements done with System.nanoTime() around checkpoint-restore long before = System.nanoTime(); // ... checkpoint & restore .. long after = System.nanoTime(); System.out.println(after - before); One of the way to fix that is to fix just monotonicity of System.nanoTime(), preventing that going backward. long diff = javaTimeNanos() - checkpoint_nanos; if (diff < 0) { javaTimeNanos_offset = -diff } But that anyway will fail if the timens has been changed (that we can probably detect) or when the image is transfered to another machine (that should be possible, but probably more tricky). Do we have any way to detect we've been transfered to another machine and report a warning in this case, with possible millis approximation enabled? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1150348959 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1150356508 From duke at openjdk.org Thu Mar 30 13:02:19 2023 From: duke at openjdk.org (Ashutosh Mehra) Date: Thu, 30 Mar 2023 13:02:19 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Thu, 30 Mar 2023 11:07:31 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Correct time since restore > > Crac-criu does not use restore timens [1] since once a bug in kernel or criu caused timedwait to return immediatelly everytime that is called after restore. I don't remember the bug exactly (already fixed), but I believe it was discussed on this maillist. > > In general, we should not to depend on very obscure linux abillities, as this reduce chances we'd be able to run on something rather than linux. Having the code in the JVM provides better control for the implementation the java spec. > > [1] https://github.com/CRaC/criu/commit/1cb2f4a518a4ae471a1df7a9b540203c1efaf1ba @AntonKozlov > Crac-criu does not use restore timens [1] since once a bug in kernel or criu caused timedwait to return immediatelly everytime that is called after restore. I don't remember the bug exactly (already fixed), but I believe it was discussed on this maillist https://github.com/CRaC/criu/commit/1cb2f4a518a4ae471a1df7a9b540203c1efaf1ba commit is dated July 14, 2020, but the crac-dev archives has earliest mailing list from Sept 2021. Is there some other mailing list this was discussed on? I am interested in understanding the problem that prompted not to use timens in criu. Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. > In general, we should not to depend on very obscure linux abillities, as this reduce chances we'd be able to run on something rather than linux. I don't think timens can be put in the category of obscure linux ability. It has even made its way into container runtime spec: https://github.com/opencontainers/runtime-spec/pull/1151. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1490262494 From duke at openjdk.org Thu Mar 30 14:21:18 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 30 Mar 2023 14:21:18 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Tue, 28 Mar 2023 10:06:08 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Correct time since restore > > src/hotspot/share/runtime/os.cpp line 2041: > >> 2039: checkpoint_millis = javaTimeMillis(); >> 2040: checkpoint_nanos = javaTimeNanos(); >> 2041: } > > Sorry, I don't follow why for repeated checkpoint-restore it's enough or desireable to store checkpoint_nanos once. Suppose we've done first restore on the same machine that was used for checkpoint (roughly millis diff equals to nanos diff, so time adjustement implemented does not contritube anything for first restore), then checkpoint again. And then perform 2nd restore on another machine in a very short time, suppose immediatelly. Then `checkpoint_nanos - javaTimeNanos()` adjustement can become any value, depending on the difference between clocks on two machines, completely unrelated to each other. It estabilishes relation between real time and monotonic time, and it's sufficient to do that just once // 1st checkpoint checkpoint_millis = 1 checkpoint_nanos = 1_000_000 // almost immediate restore javaTimeMillis() -> 2 monotonic nanos -> 2_000_000 diff_millis = 1 javaTimeNanos_offset = 0 // second checkpoint does not record anything // 2nd restore javaTimeMills() -> 3 monotonic nanos -> 100_000_000 diff_millis = 2 javaTimeNanos_offset = 1_000_000 - 100_000_000 + 2 * 1_000_000 = -97_000_000 javaTimeNanos() -> system monotonic nanos + offset = 3_000_000 In the last step we've read a value that makes sense to compare to any nanoTime in any previous phase. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1153334965 From duke at openjdk.org Thu Mar 30 14:28:48 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 30 Mar 2023 14:28:48 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Thu, 30 Mar 2023 14:17:01 GMT, Radim Vansa wrote: >> src/hotspot/share/runtime/os.cpp line 2041: >> >>> 2039: checkpoint_millis = javaTimeMillis(); >>> 2040: checkpoint_nanos = javaTimeNanos(); >>> 2041: } >> >> Sorry, I don't follow why for repeated checkpoint-restore it's enough or desireable to store checkpoint_nanos once. Suppose we've done first restore on the same machine that was used for checkpoint (roughly millis diff equals to nanos diff, so time adjustement implemented does not contritube anything for first restore), then checkpoint again. And then perform 2nd restore on another machine in a very short time, suppose immediatelly. Then `checkpoint_nanos - javaTimeNanos()` adjustement can become any value, depending on the difference between clocks on two machines, completely unrelated to each other. > > It estabilishes relation between real time and monotonic time, and it's sufficient to do that just once > > // 1st checkpoint > checkpoint_millis = 1 > checkpoint_nanos = 1_000_000 > // almost immediate restore > javaTimeMillis() -> 2 > monotonic nanos -> 2_000_000 > diff_millis = 1 > javaTimeNanos_offset = 0 > // second checkpoint does not record anything > // 2nd restore > javaTimeMills() -> 3 > monotonic nanos -> 100_000_000 > diff_millis = 2 > javaTimeNanos_offset = 1_000_000 - 100_000_000 + 2 * 1_000_000 = -97_000_000 > javaTimeNanos() -> system monotonic nanos + offset = 3_000_000 > > In the last step we've read a value that makes sense to compare to any nanoTime in any previous phase. Oh wait a sec, you're partially right - since we always use javaTimeNanos() if the offset calculated after the first restore wouldn't be zero, we wouldn't have this right. I should zero the offset before calculating it again. Too bad I can't create a test for that yet. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1153348585 From duke at openjdk.org Thu Mar 30 14:36:59 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 30 Mar 2023 14:36:59 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Tue, 28 Mar 2023 10:12:34 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Correct time since restore > > src/hotspot/share/runtime/os.cpp line 2051: > >> 2049: diff_millis = 0; >> 2050: } >> 2051: javaTimeNanos_offset = checkpoint_nanos - javaTimeNanos() + diff_millis * 1000000L; > > So all the difference in MONOTONIC clocks is eliminated and replaced with REALITIME estimation, even if the restore was performed on the same machine and the monotonic clocks difference made sense. That may invalidate some measurements done with System.nanoTime() around checkpoint-restore > > > long before = System.nanoTime(); > // ... checkpoint & restore .. > long after = System.nanoTime(); > System.out.println(after - before); > > > One of the way to fix that is to fix just monotonicity of System.nanoTime(), preventing that going backward. > > > long diff = javaTimeNanos() - checkpoint_nanos; > if (diff < 0) { > javaTimeNanos_offset = -diff > } > > > But that anyway will fail if the timens has been changed (that we can probably detect) or when the image is transfered to another machine (that should be possible, but probably more tricky). Do we have any way to detect we've been transfered to another machine and report a warning in this case, with possible millis approximation enabled? If the monotonic time on the machine advanced by X, the offset can't be lower than -X (as the millis part is always positive) and therefore any difference between times read before and after will be at least 0. Therefore things snapshot might seem to take no time, but System.nanoTime() is still monotonic. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1153360551