From chf at redhat.com Fri Oct 1 16:06:34 2021 From: chf at redhat.com (Christine Flood) Date: Fri, 1 Oct 2021 12:06:34 -0400 Subject: CFV: New CRaC Committer: Gil Tene Message-ID: Vote: yes From heidinga at redhat.com Tue Oct 5 14:26:24 2021 From: heidinga at redhat.com (Dan Heidinga) Date: Tue, 5 Oct 2021 10:26:24 -0400 Subject: CFV: New CRaC Committer: Gil Tene In-Reply-To: <80ed0ed9-80e6-cd6a-53e2-cface1412cb0@azul.com> References: <80ed0ed9-80e6-cd6a-53e2-cface1412cb0@azul.com> Message-ID: Vote: yes --Dan On Wed, Sep 29, 2021 at 9:03 AM Anton Kozlov wrote: > I hereby nominate Gil Tene to CRaC Committer. > > Gil is the CTO of Azul who is actively participating in all design > discussions > and decisions for the CRaC project. Gil is the one who has brought an idea > of > coordinating Checkpoint/Restore with Java Runtime. He contributed a lot of > insights during discussions of approaches, details of the implementation, > and > the analysis of the prototype. Gil is going to contribute to the further > development of the OpenJDK CRaC project helping to solve complex problems. > > Votes are due by Wednesday, 13 October 2021, 14:00:00 GMT. > > Only current CRaC Committers [1] are eligible to vote > on this nomination. Votes must be cast in the open by replying > to this mailing list. > > For Lazy Consensus voting instructions, see [2]. > > Thanks, > Anton > > [1] https://openjdk.java.net/census > [2] https://openjdk.java.net/projects/#committer-vote > > From akozlov at azul.com Mon Oct 11 16:17:26 2021 From: akozlov at azul.com (Anton Kozlov) Date: Mon, 11 Oct 2021 19:17:26 +0300 Subject: CFV: New CRaC Committer: Gil Tene In-Reply-To: <80ed0ed9-80e6-cd6a-53e2-cface1412cb0@azul.com> References: <80ed0ed9-80e6-cd6a-53e2-cface1412cb0@azul.com> Message-ID: Vote: yes On 9/29/21 16:02, Anton Kozlov wrote: > I hereby nominate Gil Tene to CRaC Committer. > > Gil is the CTO of Azul who is actively participating in all design discussions > and decisions for the CRaC project. Gil is the one who has brought an idea of > coordinating Checkpoint/Restore with Java Runtime. He contributed a lot of > insights during discussions of approaches, details of the implementation, and > the analysis of the prototype. Gil is going to contribute to the further > development of the OpenJDK CRaC project helping to solve complex problems. > > Votes are due by Wednesday, 13 October 2021, 14:00:00 GMT. > > Only current CRaC Committers [1] are eligible to vote > on this nomination.? Votes must be cast in the open by replying > to this mailing list. > > For Lazy Consensus voting instructions, see [2]. > > Thanks, > Anton > > [1] https://openjdk.java.net/census > [2] https://openjdk.java.net/projects/#committer-vote From akozlov at azul.com Tue Oct 12 00:56:41 2021 From: akozlov at azul.com (akozlov at azul.com) Date: Tue, 12 Oct 2021 00:56:41 -0000 Subject: [crac] RFR: Merge jdk-17+35 Message-ID: Merge tag jdk-17+35. All CRaC-specific tests pass, although there are only few of them. ------------- Commit messages: - Remove duplicating CRaC specific workflow - Merge tag 'jdk-17+35' into crac - 8270872: Final nroff manpage update for JDK 17 - 8271588: JFR Recorder Thread crashed with SIGSEGV in write_klass - 8271863: ProblemList serviceability/sa/TestJmapCore.java on linux-x64 with ZGC - 8271894: ProblemList javax/swing/JComponent/7154030/bug7154030.java in JDK17 - 8271877: ProblemList jdk/jfr/event/gc/detailed/TestEvacuationFailedEvent.java in JDK17 - 8271064: ZGC several jvm08 perf regressions after JDK-8268372 - 8067223: [TESTBUG] Rename Whitebox API package - 8271150: Remove EA from JDK 17 version string starting with Initial RC promotion on Aug 5, 2021(B34) - ... and 7570 more: https://git.openjdk.java.net/crac/compare/c53fb564...9e54b029 The webrevs contain the adjustments done while merging with regards to each parent branch: - crac: https://webrevs.openjdk.java.net/?repo=crac&pr=2&range=00.0 - jdk-17+35: https://webrevs.openjdk.java.net/?repo=crac&pr=2&range=00.1 Changes: https://git.openjdk.java.net/crac/pull/2/files Stats: 4805557 lines in 33019 files changed: 2496279 ins; 2134443 del; 174835 mod Patch: https://git.openjdk.java.net/crac/pull/2.diff Fetch: git fetch https://git.openjdk.java.net/crac pull/2/head:pull/2 PR: https://git.openjdk.java.net/crac/pull/2 From akozlov at azul.com Tue Oct 12 01:15:05 2021 From: akozlov at azul.com (akozlov at azul.com) Date: Tue, 12 Oct 2021 01:15:05 -0000 Subject: [crac] RFR: Merge jdk-17+35 In-Reply-To: References: Message-ID: On Fri, 24 Sep 2021 16:21:48 GMT, Anton Kozlov wrote: > Merge tag jdk-17+35. All CRaC-specific tests pass, although there are only few of them. I'm going to tag the merge as `crac-17+1`. That is, first build of JDK17 with CRaC. ------------- PR: https://git.openjdk.java.net/crac/pull/2 From akozlov at azul.com Thu Oct 14 10:44:50 2021 From: akozlov at azul.com (Anton Kozlov) Date: Thu, 14 Oct 2021 13:44:50 +0300 Subject: Result: New CRaC Committer: Gil Tene Message-ID: <9918fbdf-f42e-0dcc-8532-4a14d5dd3a8e@azul.com> Voting for Gil Tene [1] is now closed. Yes: 3 Veto: 0 Abstain: 0 According to the Bylaws definition of Lazy Consensus, this is sufficient to approve the nomination. Thanks, Anton [1] https://mail.openjdk.java.net/pipermail/crac-dev/2021-September/000016.html From heidinga at redhat.com Fri Oct 15 19:52:42 2021 From: heidinga at redhat.com (Dan Heidinga) Date: Fri, 15 Oct 2021 15:52:42 -0400 Subject: Portability of checkpoints? Message-ID: How portable should CRaC checkpoints be? When we were looking at checkpoint / restore in OpenJ9, one of the issues we ran into early on was related to the portability of the checkpoints. The use case was checkpointing an application server during a CI build and then restoring it multiple times - basically speeding up the deployments by shifting the work to CI system. The implication of this approach is that a checkpoint created on one machine may not be valid on another due to changes in the target architecture in addition to changes in the environment. It would be good if we could surface a list of the things that will need to be changed in the jvm and in the class libraries to address this. I see a number of places in the CRaC code that have implemented jdk.crac.Resource to add hooks to address environment changes. I don't see a corresponding set of changes for the JVM itself though. As an example, in OpenJ9 we added a commandline option to tell the jit to generate more conservative code - to jit code as though running on an older architecture so that the code was applicable across a greater set of target machines. Does Hotspot have similar options already or do we need to pursue adding them as part of this project? The discussion in [1] covers some of the background on determining default processor features and [2] is a list of differences between creation/restore environments that will need to be addressed for portability. Looking forward to hearing others thoughts on this, --Dan [1] https://github.com/eclipse-openj9/openj9/issues/7966 [2] https://github.com/eclipse-openj9/openj9/issues/12484 From akozlov at azul.com Tue Oct 19 14:10:14 2021 From: akozlov at azul.com (Anton Kozlov) Date: Tue, 19 Oct 2021 17:10:14 +0300 Subject: Portability of checkpoints? In-Reply-To: References: Message-ID: <488772ec-3d75-b149-7cbb-1ac3c89e2ffa@azul.com> On 10/15/21 22:52, Dan Heidinga wrote: > How portable should CRaC checkpoints be? The ability to run the image in a different environment makes CRaC useful. So I think the answer is as much as practical. > The implication of this approach is that a checkpoint created on one > machine may not be valid on another due to changes in the target > architecture in addition to changes in the environment. It would be good > if we could surface a list of the things that will need to be changed in > the jvm and in the class libraries to address this. This is a good use case that we'd like to support. But what java class level changes would we need in the context of the different CPU? > As an example, in OpenJ9 we added a commandline option to tell the jit to > generate more conservative code ... Does Hotspot have similar options already > or do we need to pursue adding them as part of this project? There are no such options now, and it will be great to have them as the first step toward. AFAICS, some kind of framework for CPU flags presents in aarch64 and x86 [3] and is used e.g. in [4]. But before implementation, it is worth asking for a review of a plan on hotspot-dev maillist [5]. > The discussion in [1] covers some of the background on determining default > processor features and [2] is a list of differences between > creation/restore environments that will need to be addressed for > portability. This is valuable info, such a list could you copy it here? [1]. The list [2] is very valuable in the context of CRaC and all preliminary discussions about the project. Leaving topics that require changes to the Java API, it looks like there are different levels of portability. 0. No portability. Able to restore on the same machine: CPU, operating system, and, probably, the OS has not restarted since the checkpoint. But it still may be useful for similar java programs like a scaling microservice; or javac [6]. 1. Between machines with the same operating system distribution. The CPU features set is a good example of this. Also, available memory resources can change between checkpoint and restore. We'll likely need to change JVM to handle the difference. Here we have containers -- it's interesting that even when starting on the same physical machine (same CPU), a container instance used for the checkpoint and a container for the restore may have different hard memory limits. 2. Between different distributions of the same operating system e.g. GNU/Linux. For checkpoint/restore implemented on top of CRIU it will be a problem since it stores a complete process memory. It captures the internal layout and the state of the system libraries such as libc, which may change between distributions. This level corresponds to the portability of a regular java build. There could be more levels, like: 3. Portability between different operating systems, e.g. Linux and Windows. Unlikely it will be practical to implement and we'll be unable to transfer any JNI code. -1. No portability, single restore. This can be implemented by sending Unix signals SIGSTOP/CONT, is it useful for testing?.. Thanks, Anton [1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-October/055049.html [2] https://github.com/eclipse-openj9/openj9/issues/12484 [3] https://github.com/openjdk/crac/blob/master/src/hotspot/cpu/x86/vm_version_x86.hpp#L305 [4] https://github.com/openjdk/crac/blob/master/src/hotspot/share/jvmci/vmStructs_jvmci.cpp#L760 [5] https://mail.openjdk.java.net/pipermail/hotspot-dev/ [6] https://mail.openjdk.java.net/pipermail/discuss/2021-February/005714.html From heidinga at redhat.com Wed Oct 20 14:00:10 2021 From: heidinga at redhat.com (Dan Heidinga) Date: Wed, 20 Oct 2021 10:00:10 -0400 Subject: Portability of checkpoints? In-Reply-To: <488772ec-3d75-b149-7cbb-1ac3c89e2ffa@azul.com> References: <488772ec-3d75-b149-7cbb-1ac3c89e2ffa@azul.com> Message-ID: > This is a good use case that we'd like to support. But what java class level > changes would we need in the context of the different CPU? ForkJoinPool is the canonical example here - the common pool, used by parallel streams, is initially sized based on the number of available processors. See code in [7] which is called by the static initializer. This can change when snapshotting on one machine and restoring on another. All uses of `Runtime::availableProcessors` will need to be evaluated - of which there are several in j.u.c and in the nio ThreadPool - to see if they also need to be adapted. There are sure to be other apis that need to be similarly investigated - Unsafe.pageSize() is one that comes to mind. MxBeans may be another. > > As an example, in OpenJ9 we added a commandline option to tell the jit to > > generate more conservative code ... Does Hotspot have similar options already > > or do we need to pursue adding them as part of this project? > > There are no such options now, and it will be great to have them as the first > step toward. AFAICS, some kind of framework for CPU flags presents in aarch64 > and x86 [3] and is used e.g. in [4]. But before implementation, it is worth > asking for a review of a plan on hotspot-dev maillist [5]. Thanks for the links. Looks like I have some reading to do to figure out what's already available and where it might need to evolve to. > > The discussion in [1] covers some of the background on determining default > > processor features and [2] is a list of differences between > > creation/restore environments that will need to be addressed for > > portability. > > This is valuable info, such a list could you copy it here? [1]. The list [2] is > very valuable in the context of CRaC and all preliminary discussions about the > project. As per John's note [1], I've hosted the content on crojn at [8] to avoid pasting a chunk of markdown to the list. > Leaving topics that require changes to the Java API, it looks like there are > different levels of portability. > > 0. No portability. Able to restore on the same machine: CPU, operating system, > and, probably, the OS has not restarted since the checkpoint. But it still may > be useful for similar java programs like a scaling microservice; or javac [6]. Depending on the checkpoint/restore mechanism, this may also require that files (ie: logs) haven't changed between the checkpoint & the restore. CRaC stops the checkpoint if there are open file handles but that's not a strict requirement of the underlying checkpoint mechanism (ie: CRIU is able to restore them). There's a fine line between "things that cause restore failures" and "things that prevent portability". I may be falling into the first category here. No portability seems very similar to the use cases for Java daemons such as Nailgun [9]. Useful for some small set of cases but less applicable? > 1. Between machines with the same operating system distribution. The CPU > features set is a good example of this. Also, available memory resources can > change between checkpoint and restore. We'll likely need to change JVM to > handle the difference. Here we have containers -- it's interesting that even > when starting on the same physical machine (same CPU), a container instance > used for the checkpoint and a container for the restore may have different > hard memory limits. > > 2. Between different distributions of the same operating system e.g. GNU/Linux. > For checkpoint/restore implemented on top of CRIU it will be a problem since it > stores a complete process memory. It captures the internal layout and the > state of the system libraries such as libc, which may change between > distributions. This level corresponds to the portability of a regular java > build. The "sweet spot" for checkpoint/restore may be in containers as they constrain the environment reducing the set of things to deal with. Though, as you point out above, even that's not a perfect solution as limits (memory / cpu / etc) can still change for container deployments. > > There could be more levels, like: > > 3. Portability between different operating systems, e.g. Linux and Windows. > Unlikely it will be practical to implement and we'll be unable to transfer any > JNI code. > > -1. No portability, single restore. This can be implemented by sending Unix > signals SIGSTOP/CONT, is it useful for testing?.. I don't see either of these two levels as particularly useful. (Would love to hear contrary opinions though) --Dan [7] https://github.com/openjdk/jdk/blob/895e2bd7c0bded5283eca8792fbfb287bb75016b/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java#L2564 [8] http://cr.openjdk.java.net/~heidinga/crac/snapshot_env_differences.md [9] http://martiansoftware.com/nailgun/background.html > > Thanks, > Anton > > [1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-October/055049.html > [2] https://github.com/eclipse-openj9/openj9/issues/12484 > [3] https://github.com/openjdk/crac/blob/master/src/hotspot/cpu/x86/vm_version_x86.hpp#L305 > [4] https://github.com/openjdk/crac/blob/master/src/hotspot/share/jvmci/vmStructs_jvmci.cpp#L760 > [5] https://mail.openjdk.java.net/pipermail/hotspot-dev/ > [6] https://mail.openjdk.java.net/pipermail/discuss/2021-February/005714.html > From akozlov at azul.com Thu Oct 21 13:46:28 2021 From: akozlov at azul.com (Anton Kozlov) Date: Thu, 21 Oct 2021 16:46:28 +0300 Subject: Portability of checkpoints? In-Reply-To: References: <488772ec-3d75-b149-7cbb-1ac3c89e2ffa@azul.com> Message-ID: <1448966f-da28-d3b7-7868-de8fdd857564@azul.com> On 10/20/21 17:00, Dan Heidinga wrote: >> This is a good use case that we'd like to support. But what java class level >> changes would we need in the context of the different CPU? > > ForkJoinPool is the canonical example here - the common pool, used by > parallel streams, is initially sized based on the number of available > processors. See code in [7] which is called by the static > initializer. This can change when snapshotting on one machine and > restoring on another. > > All uses of `Runtime::availableProcessors` will need to be evaluated - > of which there are several in j.u.c and in the nio ThreadPool - to see > if they also need to be adapted. There are sure to be other apis that > need to be similarly investigated - Unsafe.pageSize() is one that > comes to mind. MxBeans may be another. Oh, got it. Completely agree. I initially considered these as a part of the bigger problem, as nothing prevents users from implementing similar but own ForkJoin. Here assumptions about base methods such as the j.l.Runtime.availableProcessors needs examination, as well as callers in the JDK, and there may be users' code that similarly uses base methods. Interesting that availableProcessors can return different values over the lifetime for a long time already [1]. It seems ForkJoinPool just needs cooperation with CRaC, the API is good here. > As per John's note [1], I've hosted the content on crojn at [8] to > avoid pasting a chunk of markdown to the list. Great, thanks! There is a set of things for which users will need to fix their code. An example j.u.Locale.getDefault [2], for which we may need the speicification. Another set of things hopefully can be fixed without changes to the exposed API, at most requiring CSR [3], like the target CPU architecture requiring a set of new Hotspot flags. We are lucky availableProcessors already able to return different values, although I assume not so much code actually expects this. Probably we may try to translate this list into a list of API sites that needs thinking of. > Depending on the checkpoint/restore mechanism, this may also require > that files (ie: logs) haven't changed between the checkpoint & the > restore. CRaC stops the checkpoint if there are open file handles but > that's not a strict requirement of the underlying checkpoint mechanism > (ie: CRIU is able to restore them). There's a fine line between > "things that cause restore failures" and "things that prevent > portability". I may be falling into the first category here. For me it sounds that portabillity is about implementation and restore failures is about semantic and Java API. Would it be correct to continue to think so? > No portability seems very similar to the use cases for Java daemons > such as Nailgun [9]. Useful for some small set of cases but less > applicable? Great link. As a bit crazy and not a mature idea: would not applications running under Nailgun benefit from CRaC and be able to reinitialize? Nailgun could be another checkpoint/restore engine. >> -1. No portability, single restore. This can be implemented by sending Unix >> signals SIGSTOP/CONT, is it useful for testing?.. > > I don't see either of these two levels as particularly useful. (Would > love to hear contrary opinions though) I'm tempted to think about level -1 as the simplest non-CRIU based implementation to try another mechanism. It may also show some time-based effects of checkpoint/restore. Not something of real-world use. Thanks, Anton [1] https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#availableProcessors() [2] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Locale.html#getDefault() [3] https://wiki.openjdk.java.net/display/csr/CSR+FAQs From duke at openjdk.java.net Thu Oct 28 14:52:08 2021 From: duke at openjdk.java.net (Dan Heidinga) Date: Thu, 28 Oct 2021 14:52:08 GMT Subject: [crac] RFR: Decouple JVM and CRIU In-Reply-To: References: Message-ID: On Thu, 28 Oct 2021 09:12:39 GMT, Anton Kozlov wrote: > CREngine option to specify a program to do checkpoint/restore. > * `criuengine` to use CRIU as it was used before. > * `pauseengine` that stores PID in a file and pauses JVM until `java -XX:CRaCRestoreFrom=` is called. This should allow external mechanisms that may operate on a process. > * `simengine` is something that simulates restore immediately, with the same effect as `-XX:+CRAllowToSkipCheckpoint`. It is intended to be simplest example and could be a starting point to implement Checkpoint/Restore with e.g. the help of a VM/container/... in which JVM is running. src/hotspot/os/linux/os_linux.cpp line 5793: > 5791: os::jvm_path(path, len); > 5792: // path is ".../lib/server/libjvm.so" > 5793: char *after_elem; Not sure what the expectations are in openjdk for this but I've always preferred to see every variable be initialized at declaration: Suggestion: char *after_elem = NULL; src/java.base/unix/native/pauseengine/pauseengine.c line 65: > 63: > 64: } else if (!strcmp(action, "restore")) { > 65: FILE *pidfile = fopen(pidpath, "r"); The `FILE*` is being leaked in both these blocks ------------- PR: https://git.openjdk.java.net/crac/pull/3