From akozlov at azul.com Wed Jun 1 18:31:33 2022 From: akozlov at azul.com (Anton Kozlov) Date: Wed, 1 Jun 2022 21:31:33 +0300 Subject: AppCDS / AOT thoughts based on CLI app experience In-Reply-To: References: Message-ID: <159565a0-5c7d-84b1-a2cd-e30e0b509faa@azul.com> Thank you for the excellent write-up! Although many problems you've mentioned are not solved (and sometimes are made worse) by CRaC, I can't resist mentioning a CRaC change for CLI apps [1]. But this is offtopic, so BCCing leyden-dev and CC crac-dev. On 6/1/22 16:03, Mike Hearn wrote: > What about CRaC? It's Linux only so isn't interesting to us, given > that most devs are on Windows/macOS. The benefits for Linux servers > are clear though. Obvious question - can you make a snapshot on one > machine/Linux distro, and resume them on a totally different one, or > does it require a homogenous infrastructure? In the current implementation, we've not started working on this. By the model, CRaC prevents file dependencies at the checkpoint and allows VM to coordinate restore. So eventually we should deliver images that do not depend on the particular CPU and distribution. The feasibility of the full implementation for Mac and Windows OS is unclear. But I think a reasonable effort will be required to provide an implementation for testing and developing programs on those OSes, which will match the behavior of Linux CRaC implementation. Thanks, Anton From asmehra at redhat.com Tue Jun 7 16:05:04 2022 From: asmehra at redhat.com (Ashutosh Mehra) Date: Tue, 7 Jun 2022 12:05:04 -0400 Subject: CRaC + maven-daemon: An experience report Message-ID: Hi, With the aim to try out CRaC with a real world project to understand what it would take for the developers to adopt this kind of technology in their projects, we picked up maven-daemon and tried to use CRaC to improve the start up time of the daemon processes. We have written a report [1] on our efforts. Some of the aspects that the report touches upon: 1. different approaches developer can take to use CRaC 2. kind of code refactoring required to adopt CRaC 3. change in relationship of entities due to introduction of checkpoint-restore events in app life cycle 4. importance of the point at which snapshot is taken (in terms of effect on performance and amount of refactoring required) 5. a discussion on "snapsafety" in the context of this experiment Any questions or feedback is always welcome. [1] http://cr.openjdk.java.net/~heidinga/crac/CRaC%20in%20maven-daemon%20project%20-%20final.pdf Regards, Ashutosh Mehra From akozlov at openjdk.java.net Tue Jun 7 18:38:59 2022 From: akozlov at openjdk.java.net (Anton Kozlov) Date: Tue, 7 Jun 2022 18:38:59 GMT Subject: [crac] RFR: Fix crash after shm_open failure Message-ID: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> When `_restore_parameters` is not set (e.g. after shm_open failure[0]), VM may crash on NULL dereference [1]. The change makes _restore_parameter always valid. [0] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L6142 [1] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L415 shm_open: Function not implemented shm_open (ignoring new args): Function not implemented # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f85bce8ad37, pid=131, tid=146 # # JRE version: OpenJDK Runtime Environment (17.0) (build 17-internal+0-adhoc..crac) # Java VM: OpenJDK 64-Bit Server VM (17-internal+0-adhoc..crac, mixed mode, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64) # Problematic frame: # V [libjvm.so+0xc47d37] os::Linux::checkpoint(bool, JavaThread*)+0x107 # # Core dump will be written. Default location: /tmp/core.%e.131 # # An error report file with more information is saved as: # /tmp/hs_err_pid131.log # # If you would like to submit a bug report, please visit: # https://bugreport.java.com/bugreport/crash.jsp # ------------- Commit messages: - Fix crash after shm_open failure Changes: https://git.openjdk.java.net/crac/pull/23/files Webrev: https://webrevs.openjdk.java.net/?repo=crac&pr=23&range=00 Stats: 5 lines in 1 file changed: 1 ins; 2 del; 2 mod Patch: https://git.openjdk.java.net/crac/pull/23.diff Fetch: git fetch https://git.openjdk.java.net/crac pull/23/head:pull/23 PR: https://git.openjdk.java.net/crac/pull/23 From akozlov at azul.com Wed Jun 8 11:06:27 2022 From: akozlov at azul.com (Anton Kozlov) Date: Wed, 8 Jun 2022 14:06:27 +0300 Subject: CRaC + maven-daemon: An experience report In-Reply-To: References: Message-ID: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> On 6/7/22 19:05, Ashutosh Mehra wrote: > we picked up maven-daemon and tried to use CRaC to improve the start > up time of the daemon processes. Cool! Did that pay off? > [Checkpoint] is done by the first daemon process started by a client. After performing the requested build, it takes the checkpoint and exits. Checkpoint after a few client invocations may provide better results, e.g. at the daemon termination. > There are two options depending on the execution model. You can read > more about the execution models with CRaC in the blog post on > phase-aware source code. 1. In the first approach, the execution on > restore continues from the point where the checkpoint was taken. 2. > In the second approach, the execution on restore starts from an an > initialized image but as though the MavenDaemon?s main() was being > invoked anew. CRaC does not embed these two modes. Apps may adopt these or similar, but the relation of the modes is like a design pattern like the singleton can be implemented with Java lang, so one of these modes can be implemented with CRaC. The execution in the CRaC always continues from the checkpoint. You may make it exit as soon as restored and therefore provide option 2. So, these options are parts of the spectrum, rather than a complete set of alternatives. In Approach 2: > This also allows us to shift the time of checkpoint after the Server > instance has been closed normally Although this change is enough to > use checkpoint-restore correctly, it does not produce the expected > benefits in daemon startup time. It will be interesting to look at the changes, to get a better understanding of the implementations of the two approaches. It looks in Approach 2 you've handled components of the application, rather than e.g object fields in Approach 1. Getting the code for CRaC "higher level", handling bigger parts of an application may simplify coding and also reduce the benefit. Have you considered making the code "lower level" than Approach 1, e.g. to modify server socket abstraction in the app, if any, to hide the details of the handling w.r.t. checkpoint and restore. I assume this may make that abstraction more complex, but simplify using that abstraction. If the abstraction becomes general enough, then it can be considered to be included in CRaC JDK. > Snapsafety I still struggle to understand what is it. Is it a property of the code (e.g. if you use these classes, you are safe w.r.t. checkpoint and restore and don't need to coordinate explicitly)? Or is it a property of the state (the state can be safely checkpoint and restore -- what is safe in this case)? > Each checkpoint wants to use the same PID when restored. Two problems here. Current implementation indeed does not allow changing PID, but it is possible to some extent. The second problem, therefore, is that expected by the java code. The PID cannot change during execution right now, but the javadoc does not explicitly state that. I think it's worth clarifying and seeking the consensus in the broader OpenJDK community. From heidinga at redhat.com Wed Jun 8 13:19:46 2022 From: heidinga at redhat.com (Dan Heidinga) Date: Wed, 8 Jun 2022 09:19:46 -0400 Subject: CRaC + maven-daemon: An experience report In-Reply-To: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> References: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> Message-ID: > > Snapsafety > > I still struggle to understand what is it. Is it a property of the code > (e.g. if you use these classes, you are safe w.r.t. checkpoint and > restore and don't need to coordinate explicitly)? Or is it a property of > the state (the state can be safely checkpoint and restore -- what is > safe in this case)? > This is exactly the point Ashu's making in the document. As much as I think we would all like snapsafety to be a static property of the source code so we could analyze it easily with some static analysis, it's unfortunately more complicated than a static property. There's a temporal aspect to it - when the checkpoint is taken affects the safety of the operation. When the snapshot is taken determines what would need to be fixed up (and much of that is based on application specific invariants). The execution model on restore [0] also impacts the snapsafety. As Ashu says, using the checkpoint to create an initialized base image has a different concept of "safety" than migrating a computation from one host to another. Different pieces of state will need to be modified in each case and different invariants will hold (or be broken). The .NET community took an interesting approach in their "Native AOT" story for "trimming" applications [1] that may be reusable for snapsaftey - they added warnings for certain operations that are incompatible with trimming (dead code elimination) and then require library authors to annotate methods that do generate the warnings. The annotations bubble up the call chain to the public apis and then library consumers can determine whether to call such apis or not. Building on this idea, if methods and classes are correctly annotated (with what annotations? tbd) it may be possible to do some analysis when the checkpoint is created to determine whether the current state is "snapsafe" or not. This is not so much a static property that can be statically analyzed, but one that must be checked when taking the checkpoint as it may require walking stacks (currently executing methods), examining loaded classes, heap walks(?), etc. I don't have a fully fleshed out idea here but wanted to float some early thoughts. The Leyden project may benefit from some of this exploration as well as it will have to tread similar ground. --Dan [0] https://danheidinga.github.io/phase-aware-source-code/ [1] https://docs.microsoft.com/en-us/dotnet/core/deploying/trimming/prepare-libraries-for-trimming From heidinga at openjdk.java.net Wed Jun 8 13:30:00 2022 From: heidinga at openjdk.java.net (Dan Heidinga) Date: Wed, 8 Jun 2022 13:30:00 GMT Subject: [crac] RFR: Fix crash after shm_open failure In-Reply-To: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> References: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> Message-ID: On Tue, 7 Jun 2022 18:30:12 GMT, Anton Kozlov wrote: > When `_restore_parameters` is not set (e.g. after shm_open failure[0]), VM may crash on NULL dereference [1]. The change makes _restore_parameter always valid. > > [0] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L6142 > [1] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L415 > > > shm_open: Function not implemented > shm_open (ignoring new args): Function not implemented > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007f85bce8ad37, pid=131, tid=146 > # > # JRE version: OpenJDK Runtime Environment (17.0) (build 17-internal+0-adhoc..crac) > # Java VM: OpenJDK 64-Bit Server VM (17-internal+0-adhoc..crac, mixed mode, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0xc47d37] os::Linux::checkpoint(bool, JavaThread*)+0x107 > # > # Core dump will be written. Default location: /tmp/core.%e.131 > # > # An error report file with more information is saved as: > # /tmp/hs_err_pid131.log > # > # If you would like to submit a bug report, please visit: > # https://bugreport.java.com/bugreport/crash.jsp > # src/hotspot/os/linux/os_linux.cpp line 6146: > 6144: > 6145: delete _restore_parameters; > 6146: _restore_parameters = CracRestoreParameters::read_from(shmfd); `CracRestoreParameters::read_from` can return NULL. If we need to ensure `_restore_parameters` is not null, then we need to do something like this instead: Suggestion: CracRestoreParameters *original_parameters = _restore_parameters; _restore_parameters = CracRestoreParameters::read_from(shmfd); if (_restore_parameters == NULL) { _restore_parameters = original_parameters; } else { delete original_parameters; } ------------- PR: https://git.openjdk.java.net/crac/pull/23 From asmehra at redhat.com Wed Jun 8 14:08:39 2022 From: asmehra at redhat.com (Ashutosh Mehra) Date: Wed, 8 Jun 2022 10:08:39 -0400 Subject: CRaC + maven-daemon: An experience report In-Reply-To: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> References: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> Message-ID: > > > we picked up maven-daemon and tried to use CRaC to improve the start > > up time of the daemon processes. > Cool! Did that pay off? Yes, it definitely improved the startup time of the daemon process. I haven't measured the actual improvement, but it is very apparent - almost instant startup on restore. Checkpoint after a few client invocations may provide better results, > e.g. at the daemon termination. > Difficult to quantify "few" here without doing some performance measurements. So for PoC I took the snapshot after the first request, which seems to provide considerable startup improvement. > It will be interesting to look at the changes, to get a better > understanding of the implementations of the two approaches. > Sorry, the links didn't work in the earlier document. We have fixed that now. You should be able to check out the changesets from the links in the doc. Have you considered making the code > "lower level" than Approach 1, e.g. to modify server socket abstraction > in the app, if any, to hide the details of the handling w.r.t. > checkpoint and restore. > Unfortunately it does not use any abstraction over ServerSocketChannel. So the Server class has to take care of the channel on checkpoint-restore events. The PID cannot change during > execution right now, but the javadoc does not explicitly state that. I > think it's worth clarifying and seeking the consensus in the broader > OpenJDK community. > It really depends on the state captured in the checkpoint image. If the checkpoint only captures the Java program state, then the effect of changing the PID is within the boundary of the jvm/jdk and can be taken care of. But in case of CRIU, the whole process state is serialized which includes native libraries, OS resources and what not. They may not play well if PID is changed. Regards, Ashutosh Mehra On Wed, Jun 8, 2022 at 7:06 AM Anton Kozlov wrote: > On 6/7/22 19:05, Ashutosh Mehra wrote: > > we picked up maven-daemon and tried to use CRaC to improve the start > > up time of the daemon processes. > > Cool! Did that pay off? > > > [Checkpoint] is done by the first daemon process started by a > client. After performing the requested build, it takes the checkpoint > and exits. > > Checkpoint after a few client invocations may provide better results, > e.g. at the daemon termination. > > > There are two options depending on the execution model. You can read > > more about the execution models with CRaC in the blog post on > > phase-aware source code. 1. In the first approach, the execution on > > restore continues from the point where the checkpoint was taken. 2. > > In the second approach, the execution on restore starts from an an > > initialized image but as though the MavenDaemon?s main() was being > > invoked anew. > > CRaC does not embed these two modes. Apps may adopt these or similar, > but the relation of the modes is like a design pattern like the > singleton can be implemented with Java lang, so one of these modes can > be implemented with CRaC. > > The execution in the CRaC always continues from the checkpoint. You may > make it exit as soon as restored and therefore provide option 2. So, > these options are parts of the spectrum, rather than a complete set of > alternatives. > > In Approach 2: > > This also allows us to shift the time of checkpoint after the Server > > instance has been closed normally Although this change is enough to > > use checkpoint-restore correctly, it does not produce the expected > > benefits in daemon startup time. > > It will be interesting to look at the changes, to get a better > understanding of the implementations of the two approaches. > > It looks in Approach 2 you've handled components of the application, > rather than e.g object fields in Approach 1. Getting the code for CRaC > "higher level", handling bigger parts of an application may simplify > coding and also reduce the benefit. Have you considered making the code > "lower level" than Approach 1, e.g. to modify server socket abstraction > in the app, if any, to hide the details of the handling w.r.t. > checkpoint and restore. I assume this may make that abstraction more > complex, but simplify using that abstraction. If the abstraction > becomes general enough, then it can be considered to be included in CRaC > JDK. > > > Snapsafety > > I still struggle to understand what is it. Is it a property of the code > (e.g. if you use these classes, you are safe w.r.t. checkpoint and > restore and don't need to coordinate explicitly)? Or is it a property of > the state (the state can be safely checkpoint and restore -- what is > safe in this case)? > > > Each checkpoint wants to use the same PID when restored. > > Two problems here. Current implementation indeed does not allow changing > PID, but it is possible to some extent. The second problem, therefore, > is that expected by the java code. The PID cannot change during > execution right now, but the javadoc does not explicitly state that. I > think it's worth clarifying and seeking the consensus in the broader > OpenJDK community. > > From asmehra at redhat.com Wed Jun 8 14:11:04 2022 From: asmehra at redhat.com (Ashutosh Mehra) Date: Wed, 8 Jun 2022 10:11:04 -0400 Subject: CRaC + maven-daemon: An experience report In-Reply-To: References: Message-ID: Sorry for the broken links in the document [1] . We have fixed that now. [1] http://cr.openjdk.java.net/~heidinga/crac/CRaC%20in%20maven-daemon%20project%20-%20final.pdf Regards, Ashutosh Mehra On Tue, Jun 7, 2022 at 12:05 PM Ashutosh Mehra wrote: > Hi, > With the aim to try out CRaC with a real world project to understand what > it would take for the developers to adopt this kind of technology in their > projects, > we picked up maven-daemon and tried to use CRaC to improve the start up > time of the daemon processes. > > We have written a report [1] on our efforts. Some of the aspects that the > report touches upon: > 1. different approaches developer can take to use CRaC > 2. kind of code refactoring required to adopt CRaC > 3. change in relationship of entities due to introduction of > checkpoint-restore events in app life cycle > 4. importance of the point at which snapshot is taken (in terms of effect > on performance and amount of refactoring required) > 5. a discussion on "snapsafety" in the context of this experiment > > Any questions or feedback is always welcome. > > [1] > http://cr.openjdk.java.net/~heidinga/crac/CRaC%20in%20maven-daemon%20project%20-%20final.pdf > > Regards, > Ashutosh Mehra > From akozlov at openjdk.java.net Wed Jun 8 15:46:00 2022 From: akozlov at openjdk.java.net (Anton Kozlov) Date: Wed, 8 Jun 2022 15:46:00 GMT Subject: [crac] RFR: Fix crash after shm_open failure [v2] In-Reply-To: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> References: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> Message-ID: > When `_restore_parameters` is not set (e.g. after shm_open failure[0]), VM may crash on NULL dereference [1]. The change makes _restore_parameter always valid. > > [0] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L6142 > [1] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L415 > > > shm_open: Function not implemented > shm_open (ignoring new args): Function not implemented > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007f85bce8ad37, pid=131, tid=146 > # > # JRE version: OpenJDK Runtime Environment (17.0) (build 17-internal+0-adhoc..crac) > # Java VM: OpenJDK 64-Bit Server VM (17-internal+0-adhoc..crac, mixed mode, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0xc47d37] os::Linux::checkpoint(bool, JavaThread*)+0x107 > # > # Core dump will be written. Default location: /tmp/core.%e.131 > # > # An error report file with more information is saved as: > # /tmp/hs_err_pid131.log > # > # If you would like to submit a bug report, please visit: > # https://bugreport.java.com/bugreport/crash.jsp > # Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Handle read_from returning NULL ------------- Changes: - all: https://git.openjdk.java.net/crac/pull/23/files - new: https://git.openjdk.java.net/crac/pull/23/files/62e4bb7a..ba466c16 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=crac&pr=23&range=01 - incr: https://webrevs.openjdk.java.net/?repo=crac&pr=23&range=00-01 Stats: 5 lines in 1 file changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/crac/pull/23.diff Fetch: git fetch https://git.openjdk.java.net/crac pull/23/head:pull/23 PR: https://git.openjdk.java.net/crac/pull/23 From akozlov at openjdk.java.net Wed Jun 8 15:46:01 2022 From: akozlov at openjdk.java.net (Anton Kozlov) Date: Wed, 8 Jun 2022 15:46:01 GMT Subject: [crac] RFR: Fix crash after shm_open failure [v2] In-Reply-To: References: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> Message-ID: On Wed, 8 Jun 2022 13:25:49 GMT, Dan Heidinga wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Handle read_from returning NULL > > src/hotspot/os/linux/os_linux.cpp line 6146: > >> 6144: >> 6145: delete _restore_parameters; >> 6146: _restore_parameters = CracRestoreParameters::read_from(shmfd); > > `CracRestoreParameters::read_from` can return NULL. If we need to ensure `_restore_parameters` is not null, then we need to do something like this instead: > > Suggestion: > > CracRestoreParameters *original_parameters = _restore_parameters; > _restore_parameters = CracRestoreParameters::read_from(shmfd); > if (_restore_parameters == NULL) { > _restore_parameters = original_parameters; > } else { > delete original_parameters; > } Indeed, thanks! Fixed ------------- PR: https://git.openjdk.java.net/crac/pull/23 From heidinga at openjdk.java.net Wed Jun 8 15:50:03 2022 From: heidinga at openjdk.java.net (Dan Heidinga) Date: Wed, 8 Jun 2022 15:50:03 GMT Subject: [crac] RFR: Fix crash after shm_open failure [v2] In-Reply-To: References: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> Message-ID: On Wed, 8 Jun 2022 15:46:00 GMT, Anton Kozlov wrote: >> When `_restore_parameters` is not set (e.g. after shm_open failure[0]), VM may crash on NULL dereference [1]. The change makes _restore_parameter always valid. >> >> [0] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L6142 >> [1] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L415 >> >> >> shm_open: Function not implemented >> shm_open (ignoring new args): Function not implemented >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # SIGSEGV (0xb) at pc=0x00007f85bce8ad37, pid=131, tid=146 >> # >> # JRE version: OpenJDK Runtime Environment (17.0) (build 17-internal+0-adhoc..crac) >> # Java VM: OpenJDK 64-Bit Server VM (17-internal+0-adhoc..crac, mixed mode, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64) >> # Problematic frame: >> # V [libjvm.so+0xc47d37] os::Linux::checkpoint(bool, JavaThread*)+0x107 >> # >> # Core dump will be written. Default location: /tmp/core.%e.131 >> # >> # An error report file with more information is saved as: >> # /tmp/hs_err_pid131.log >> # >> # If you would like to submit a bug report, please visit: >> # https://bugreport.java.com/bugreport/crash.jsp >> # > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Handle read_from returning NULL lgtm ------------- Marked as reviewed by heidinga (Committer). PR: https://git.openjdk.java.net/crac/pull/23 From akozlov at azul.com Fri Jun 10 13:29:18 2022 From: akozlov at azul.com (Anton Kozlov) Date: Fri, 10 Jun 2022 16:29:18 +0300 Subject: CRaC + maven-daemon: An experience report In-Reply-To: References: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> Message-ID: <2bd4db04-7218-0d49-cd64-971cf63f98ce@azul.com> On 6/8/22 16:19, Dan Heidinga wrote: >>> Snapsafety >> >> I still struggle to understand what is it. Is it a property of the code >> (e.g. if you use these classes, you are safe w.r.t. checkpoint and >> restore and don't need to coordinate explicitly)? Or is it a property of >> the state (the state can be safely checkpoint and restore -- what is >> safe in this case)? >> > > This is exactly the point Ashu's making in the document. As much as I > think we would all like snapsafety to be a static property of the > source code so we could analyze it easily with some static analysis, > it's unfortunately more complicated than a static property. > Hence my question :) Let's say that to be a property of the object state. Then a class has a property if all objects of the class have the property in all possible states. I don't see any other way for correctness and secureness to be defined for a class, other than providing Resource implementation on a per-class basis, and taking care of class functionality and internal invariants. That is, no mark or a predicate on the code or the Resource implementation can imply real safety -- that will always remain a non-formal property that should be aligned with the surrounding context in the class. For example, adding another field and its initializaion can change the class from safe to unsafe. Such change is hard to correlate with necessary changes in Resource implementation. > There's a temporal aspect to it - when the checkpoint is taken affects > the safety of the operation. When the snapshot is taken determines > what would need to be fixed up (and much of that is based on > application specific invariants). > > The execution model on restore [0] also impacts the snapsafety. As > Ashu says, using the checkpoint to create an initialized base image > has a different concept of "safety" than migrating a computation from > one host to another. Different pieces of state will need to be > modified in each case and different invariants will hold (or be > broken). Indeed. Is the formal property worth pursuing then? This was going to be a language aid for app developers to annotate safe parts of their programs, and for us to annotate parts of JDK. While we can attempt to annotate JDK correctly and fully, we cannot control how the language feature will be used by users. And for them, a better annotation mechanism or a programming model (like reactive programming) may exist. How about letting users decide how and when to annotate their programs, and concentrate on JDK needs and how JDK is used by applications first, as we understand these better? For example, what's missing in the JDK so the app won't need changing at all? And what parts of the app are absolutely necessary to change. Is it possible for JDK to provide a set of utilities to ease those changes? > The .NET community took an interesting approach in their "Native AOT" > story for "trimming" applications [1] that may be reusable for > snapsaftey - they added warnings for certain operations that are > incompatible with trimming (dead code elimination) and then require > library authors to annotate methods that do generate the warnings. > The annotations bubble up the call chain to the public apis and then > library consumers can determine whether to call such apis or not. > > Building on this idea, if methods and classes are correctly annotated > (with what annotations? tbd) it may be possible to do some analysis > when the checkpoint is created to determine whether the current state > is "snapsafe" or not. This is not so much a static property that can > be statically analyzed, but one that must be checked when taking the > checkpoint as it may require walking stacks (currently executing > methods), examining loaded classes, heap walks(?), etc. Now it's possible to create a runtime check for the object state safety, that is to create Resource's beforeCheckpoint. An unsafe object may always throw an Exception. Won't this be even more flexible? This relates a lot to Snapsafety of core library classes [1], I'll reply there. Thanks for bringing more context, -- Anton [1] https://mail.openjdk.java.net/pipermail/crac-dev/2022-May/000222.html From akozlov at azul.com Fri Jun 10 13:51:05 2022 From: akozlov at azul.com (Anton Kozlov) Date: Fri, 10 Jun 2022 16:51:05 +0300 Subject: Snapsafety of core library classes In-Reply-To: References: Message-ID: On 5/20/22 16:38, Dan Heidinga wrote: > On Thu, May 19, 2022 at 8:09 AM Volker Simonis wrote: >> I wonder if anybody has thought about how snapsafety for the core >> library classes should be implemented in CRaC? By "snapsafety" I mean >> correct and secure operation after restoring a JVM process which was >> previously checkpointed and possibly cloned. Apparently two major problems (likely with intersections) mentioned here. Does the JVM state makes sense and safe after was restored as an another instance in possibly different environment. And the same for the state cloned. Do we care only about these two situations? I'm trying to understand a set of conditions, so once something is proved to be secure and correct in each of them, then that will be considered secure and correct universally. Still safety and correctness are properties that are hardly formalizeable. They mean different things for different classes. And single class may be correct or not depending on the context how it is used. > This is currently being developed on an ad-hoc basis in CRaC. Look > for classes that implement the jdk.crac.Resource interface and the > actions they take in the ::afterRestore / ::beforeCheckpoint methods > to see how each class has addressed its own "snapsafety". > > To your point, I think we're still exploring and determining the cases > that are snapsafe (or not). We can look at the classes GraalVM has > patched with Substitutions as a starting set of classes that will need > adaptation to be snapsafe. That will help identify a starting set but > the full set will be larger. A good starting point, but I think Substitutions contains more than we need, like providing java equalents for otherwise unanalyzable native code. >> The first question is about deciding which classes can be considered >> snapsafe? Naively any class whose objects hold some state will be >> affected by snapshotting and cloning. For simple classes like String >> or Integer we know that their objects are constant and cloning them >> doesn't do any harm. Objects of other classes might however contain >> more sensitive state like caches, unique identifiers, certificates, >> encryption keys etc. which shouldn't be cloned or which become invalid >> after restore. Interesting, these examples are related to things outside of JVM. Initially, I've considered these to be the source of most of the problems. Good to know the j.l. Random example that's problematic if cloned. It will be interesting to find others, that would uncover more situations that we'll need to consider. > Agreed. Though each class will need to be individually examined to > ensure that the changes to make it snapshot don't break the invariants > of the class. This looks the same for me. > Looking just at caches as an example, it may seem safe > to clean out the cache before a checkpoint but doing so may break > invariants about canonicalization of values as those looked up prior > to the checkpoint may be different (not ==) to those looked up after > restore. Isn't this just breaking the sequential logic of the program? If an operation could not be triggered at a random moment without breaking application invariants, I suppose this is just not a good implementation of the Resource. . This probably should be the Rule 0, don't break yourself :). >> By looking at the current CRaC repository [1] I can see that some >> classes (e.g. sun.security.provider.SecureRandom or >> sun.security.provider.NativePRNG.RandomIO) directly implement >> j.i.c.JDKResource in order to make them snapsafe. But all the classes >> which do so, are non-public. This means that snapsafety is currently a >> "hidden", implicit feature of some classes in the core library (i.e. >> if I create a new j.s.SecureRandom object, I can not know if it will >> be snapsafe or not). >> >> Do we want to make snapsafety an undocumented, implicit feature or do >> we want to explicitly call it out in the JavaDoc, e.g. by forcing >> classes which want to be snapsafe to implement javax.crac.Resource >> (similar to implementing Serializable)? I think this should be a text in javadoc describing what happens on checkpoint and restore. Thus, we'll be able to specify the behavior in the terms of the original code. By reading text users will be able to decide if their programs are correct and safe or do they need to do something app-specific. And we'll be able to specify intentionally omitted handling, for j.l.Random why it is not reinitialized. > Bringing snapsafety into the language makes sense. Implementing > Resource is probably overkill for most classes as their safety is an > emergent property of the field's snap safety. Can we reverse this to > tag "snap-unsafe" classes and have javac warn / error when compiling a > class with snap-unsafe fields unless they implement Resource?> > Does the concept of snapsafety need to differentiate between the > static state of the class and its instances? > After some thought, I think that the formal checkable property may harm. By introducing one, we'll create two different programming models, a regular java and "snapsafe java", splitting the language without a need. With marking safe classes and classes unsafe by default (an option from below), we'll make a lot of valid states non-checkpointable. We'll likely enlarge safe classes set over time, making our new programming model a moving target. If classes are safe by default, what is the reason to mark anything? With marking unsafe classes instead and classes are safe by default, wouldn't it be better to fix unsafe ones? Or unconditionally throw in beforeCheckpoint, which is already supported -- only an existing live but incompatible with checkpoint object will prevent checkpoint, not the fact the dangerous code was referenced in the past. >> I think both approaches have their pros and cons. If we make >> snapsafety an explicit feature, we tell users that the corresponding >> classes will behave correctly on snapshot and restore events. But what >> about all the other classes in the core libraries. Are they all >> snapsafe or snapunsafe by default >> >> If we make snapsafety an implicit feature it would become an >> "implementation detail". This means we could have JDKs which are >> snapsafe while other are not. It also means we could make older JDK >> version snapsafe which would not be possible with the explicit model >> because it is impossible to retrofit classes in older releases to >> implement new interfaces. I don't see the cons of this :) Negating pros for explicit one (users don't know if they are safe on checkpoint), we cannot tell in advance that the class satisfies all possible user needs and expectations. When the behavior is unclear from the doc, that is the bug in the JDK that needs fixing by providing the doc, and, optionally, actions for the checkpoint and restore. > I'd prefer to make it explicit in the programming model to avoid the > "sins of serialization". Brian wrote a document titled "Towards > Better Serialization" [A] (Thanks for the great link!) > where it outlines the issues with > serialization, including: > * "Pretends to be a library feature, but isn't", "Serialization pretends to be a library feature???you opt in by implementing the Serializable interface, and serialize with ObjectOutputStream. In reality, though, serialization extracts object state and recreates objects via privileged, extralinguistic mechanisms, bypassing constructors and ignoring class and field accessibility." What part of the CRaC exhibits the similar problem? I think the major problem for Serialization here is that incapsulation is broken. Our notification API is implemented in pure Java without reflections, etc. > * "Pretends to be a statically typed feature, but isn't", and "Serializability is a function of an object?s dynamic type, not its static type; implements Serializable doesn?t actually mean that instances are serializable, just that they are not overtly serialization-hostile. So, despite the requirement to opt-in via the static type system, doing so gives you little confidence that your instances are actually serializable." Not sure what is meant by dynamic type, but if we introduce a mark for the code, we'll get the same low level of confidence the safety was implemented correctly. > * "Magic methods and fields". "There are a number of ?magic? methods and fields (in the sense that they are not specified by any base class or interface) that affect the behavior of serialization. ... Because these do not exist in any public type, they?re hard to discover, and one cannot easily navigate to their specification. They are also easy to accidentally get wrong; if you spell them wrong, or get the signature wrong, or make them static members when they should be instance members, no one tells you." I think in Resource we don't have this problem. Thanks, Anton From akozlov at openjdk.java.net Fri Jun 10 13:59:43 2022 From: akozlov at openjdk.java.net (Anton Kozlov) Date: Fri, 10 Jun 2022 13:59:43 GMT Subject: [crac] RFR: Fix crash after shm_open failure [v2] In-Reply-To: References: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> Message-ID: On Wed, 8 Jun 2022 15:46:00 GMT, Anton Kozlov wrote: >> When `_restore_parameters` is not set (e.g. after shm_open failure[0]), VM may crash on NULL dereference [1]. The change makes _restore_parameter always valid. >> >> [0] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L6142 >> [1] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L415 >> >> >> shm_open: Function not implemented >> shm_open (ignoring new args): Function not implemented >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # SIGSEGV (0xb) at pc=0x00007f85bce8ad37, pid=131, tid=146 >> # >> # JRE version: OpenJDK Runtime Environment (17.0) (build 17-internal+0-adhoc..crac) >> # Java VM: OpenJDK 64-Bit Server VM (17-internal+0-adhoc..crac, mixed mode, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64) >> # Problematic frame: >> # V [libjvm.so+0xc47d37] os::Linux::checkpoint(bool, JavaThread*)+0x107 >> # >> # Core dump will be written. Default location: /tmp/core.%e.131 >> # >> # An error report file with more information is saved as: >> # /tmp/hs_err_pid131.log >> # >> # If you would like to submit a bug report, please visit: >> # https://bugreport.java.com/bugreport/crash.jsp >> # > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Handle read_from returning NULL Thanks! ------------- PR: https://git.openjdk.org/crac/pull/23 From akozlov at openjdk.java.net Fri Jun 10 13:59:43 2022 From: akozlov at openjdk.java.net (Anton Kozlov) Date: Fri, 10 Jun 2022 13:59:43 GMT Subject: [crac] Integrated: Fix crash after shm_open failure In-Reply-To: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> References: <3YL1BkbpsjpF0wN644mT2K98maGnHrq9-9BpAZHK_RI=.2f6eac00-6b97-4936-a62e-2a0bae73dbe0@github.com> Message-ID: On Tue, 7 Jun 2022 18:30:12 GMT, Anton Kozlov wrote: > When `_restore_parameters` is not set (e.g. after shm_open failure[0]), VM may crash on NULL dereference [1]. The change makes _restore_parameter always valid. > > [0] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L6142 > [1] https://github.com/openjdk/crac/blob/b2783c90a8ad81f6a8564e6cacf97a1ea0190ccd/src/hotspot/os/linux/os_linux.cpp#L415 > > > shm_open: Function not implemented > shm_open (ignoring new args): Function not implemented > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007f85bce8ad37, pid=131, tid=146 > # > # JRE version: OpenJDK Runtime Environment (17.0) (build 17-internal+0-adhoc..crac) > # Java VM: OpenJDK 64-Bit Server VM (17-internal+0-adhoc..crac, mixed mode, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0xc47d37] os::Linux::checkpoint(bool, JavaThread*)+0x107 > # > # Core dump will be written. Default location: /tmp/core.%e.131 > # > # An error report file with more information is saved as: > # /tmp/hs_err_pid131.log > # > # If you would like to submit a bug report, please visit: > # https://bugreport.java.com/bugreport/crash.jsp > # This pull request has now been integrated. Changeset: a45937b1 Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/a45937b190f10b87d21abac1c20138bbb12fbd90 Stats: 9 lines in 1 file changed: 4 ins; 2 del; 3 mod Fix crash after shm_open failure Reviewed-by: heidinga ------------- PR: https://git.openjdk.org/crac/pull/23 From inakonechnyy at openjdk.java.net Sat Jun 11 00:30:50 2022 From: inakonechnyy at openjdk.java.net (Ilarion Nakonechnyy) Date: Sat, 11 Jun 2022 00:30:50 GMT Subject: [crac] RFR: Report checkpoint processing to jcmd Message-ID: pass output stream from diagnosticCommand.cpp through java code into os_linux.cpp::VM_crac::doit() ------------- Commit messages: - whitespaces - Properly pass return code to jcmd peer, - Check the opened socket, if it is UNIX socked from jcmd. - Implementation of reporting Changes: https://git.openjdk.org/crac/pull/10/files Webrev: https://webrevs.openjdk.java.net/?repo=crac&pr=10&range=00 Stats: 273 lines in 8 files changed: 216 ins; 17 del; 40 mod Patch: https://git.openjdk.org/crac/pull/10.diff Fetch: git fetch https://git.openjdk.org/crac pull/10/head:pull/10 PR: https://git.openjdk.org/crac/pull/10 From inakonechnyy at openjdk.java.net Sat Jun 11 00:30:50 2022 From: inakonechnyy at openjdk.java.net (Ilarion Nakonechnyy) Date: Sat, 11 Jun 2022 00:30:50 GMT Subject: [crac] RFR: Report checkpoint processing to jcmd In-Reply-To: References: Message-ID: On Tue, 25 Jan 2022 15:07:55 GMT, Ilarion Nakonechnyy wrote: > pass output stream from diagnosticCommand.cpp through java code into os_linux.cpp::VM_crac::doit() The example of output in jcmd: ------------- PR: https://git.openjdk.org/crac/pull/10 From inakonechnyy at openjdk.java.net Sun Jun 12 22:27:25 2022 From: inakonechnyy at openjdk.java.net (Ilarion Nakonechnyy) Date: Sun, 12 Jun 2022 22:27:25 GMT Subject: [crac] RFR: Report checkpoint processing to jcmd In-Reply-To: References: Message-ID: <8muRKp5Bitj0gIVFcF2NeSzwfIHA-S8bAJfzdnqJmS4=.88d0c7c8-5a18-45dc-b035-1ee6b55414af@github.com> On Tue, 25 Jan 2022 15:07:55 GMT, Ilarion Nakonechnyy wrote: > pass output stream from diagnosticCommand.cpp through java code into os_linux.cpp::VM_crac::doit() With the new changes, the opened socket (from jcmd) is checked "if the socket is from jcmd" - by parsing the /proc//net/unix file and getting inode number, comparing inode with file descriptor information in checkpoint processing ( function `VM_Crac::doit()` ) The opened socket is closed after writing all information regarding command processing just before calling the CRIU engine. An output example: oot at be23d3635404:/home/source/git/crac# /home/source/git/crac/build/linux-x86_64-server-release/images/jdk/bin/jcmd target/example-jetty-1.0-SNAPSHOT.jar JDK.checkpoint 72107: JDK.checkpoint command start processing JVM: FD fd=0 type=character: details1="/dev/pts/0" OK: inherited from process env JVM: FD fd=1 type=character: details1="/dev/pts/0" OK: inherited from process env JVM: FD fd=2 type=character: details1="/dev/pts/0" OK: inherited from process env JVM: FD fd=3 type=regular: details1="/home/source/git/crac/build/linux-x86_64-server-release/images/jdk/lib/modules" OK: inherited from process env JVM: FD fd=4 type=regular: details1="/home/zulu-17381/example-jetty/target/example-jetty-1.0-SNAPSHOT.jar" OK: in classpath JVM: FD fd=5 type=regular: details1="/home/zulu-17381/example-jetty/target/dependency/jetty-server-9.4.30.v20200611.jar" OK: assured persistent JVM: FD fd=6 type=regular: details1="/home/zulu-17381/example-jetty/target/dependency/javax.servlet-api-3.1.0.jar" OK: assured persistent JVM: FD fd=7 type=regular: details1="/home/zulu-17381/example-jetty/target/dependency/jetty-http-9.4.30.v20200611.jar" OK: assured persistent JVM: FD fd=8 type=regular: details1="/home/zulu-17381/example-jetty/target/dependency/jetty-util-9.4.30.v20200611.jar" OK: assured persistent JVM: FD fd=9 type=regular: details1="/home/zulu-17381/example-jetty/target/dependency/jetty-io-9.4.30.v20200611.jar" OK: assured persistent JVM: FD fd=10 type=regular: details1="/home/zulu-17381/example-jetty/target/dependency/crac-999-20211028.191702-5.jar" OK: assured persistent JVM: FD fd=23 type=socket: details1="socket:[735739]" issock, details2="socket:[735739]" OK: jcmd socket CR: CR Checkpoint ... ------------- PR: https://git.openjdk.org/crac/pull/10 From inakonechnyy at openjdk.org Fri Jun 17 14:37:52 2022 From: inakonechnyy at openjdk.org (Ilarion Nakonechnyy) Date: Fri, 17 Jun 2022 14:37:52 GMT Subject: [crac] RFR: Force closing a "redirected to the file" standard io descriptor Message-ID: CRIU restore fail after Checkpoint stdout to file. Redirecting a stdout to the pipeline doesn't break the restore. A proposed approach forces the close of stdout, stdin, and stderr file descriptors if they are redirected to the files, on checkpoint resources verification. ------------- Commit messages: - jchek corrections - corrections - corrections - Close stdin, stderr, stdout fd before checkpoint Changes: https://git.openjdk.org/crac/pull/24/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=24&range=00 Stats: 11 lines in 1 file changed: 11 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/24.diff Fetch: git fetch https://git.openjdk.org/crac pull/24/head:pull/24 PR: https://git.openjdk.org/crac/pull/24 From volker.simonis at gmail.com Fri Jun 17 18:08:58 2022 From: volker.simonis at gmail.com (Volker Simonis) Date: Fri, 17 Jun 2022 20:08:58 +0200 Subject: CRaC + maven-daemon: An experience report In-Reply-To: <2bd4db04-7218-0d49-cd64-971cf63f98ce@azul.com> References: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> <2bd4db04-7218-0d49-cd64-971cf63f98ce@azul.com> Message-ID: On Fri, Jun 10, 2022 at 3:29 PM Anton Kozlov wrote: > > On 6/8/22 16:19, Dan Heidinga wrote: > >>> Snapsafety > >> > >> I still struggle to understand what is it. Is it a property of the code > >> (e.g. if you use these classes, you are safe w.r.t. checkpoint and > >> restore and don't need to coordinate explicitly)? Or is it a property of > >> the state (the state can be safely checkpoint and restore -- what is > >> safe in this case)? > >> > > > > This is exactly the point Ashu's making in the document. As much as I > > think we would all like snapsafety to be a static property of the > > source code so we could analyze it easily with some static analysis, > > it's unfortunately more complicated than a static property. > > I agree. And it also depends on the snapshotting mechanism. E.g. with CRaC+CRIU there's a lot of code to properly close/reopen files and handle file descriptors correctly. If we are using CRaC with Firecracker instead which takes a complete OS snapshot, that all becomes unnecessary, because all the files will still be there with the exact same state after the restore. > > Hence my question :) Let's say that to be a property of the object > state. Then a class has a property if all objects of the class have the > property in all possible states. I don't see any other way for > correctness and secureness to be defined for a class, other than > providing Resource implementation on a per-class basis, and taking care > of class functionality and internal invariants. That is, no mark or a > predicate on the code or the Resource implementation can imply real > safety -- that will always remain a non-formal property that should be > aligned with the surrounding context in the class. For example, adding > another field and its initializaion can change the class from safe to > unsafe. Such change is hard to correlate with necessary changes in > Resource implementation. > > > There's a temporal aspect to it - when the checkpoint is taken affects > > the safety of the operation. When the snapshot is taken determines > > what would need to be fixed up (and much of that is based on > > application specific invariants). > > > > The execution model on restore [0] also impacts the snapsafety. As > > Ashu says, using the checkpoint to create an initialized base image > > has a different concept of "safety" than migrating a computation from > > one host to another. Different pieces of state will need to be > > modified in each case and different invariants will hold (or be > > broken). > > Indeed. Is the formal property worth pursuing then? This was going to > be a language aid for app developers to annotate safe parts of their > programs, and for us to annotate parts of JDK. While we can attempt to > annotate JDK correctly and fully, we cannot control how the language > feature will be used by users. And for them, a better annotation > mechanism or a programming model (like reactive programming) may exist. > How about letting users decide how and when to annotate their programs, > and concentrate on JDK needs and how JDK is used by applications first, > as we understand these better? For example, what's missing in the JDK > so the app won't need changing at all? And what parts of the app are > absolutely necessary to change. Is it possible for JDK to provide a set > of utilities to ease those changes? > > > The .NET community took an interesting approach in their "Native AOT" > > story for "trimming" applications [1] that may be reusable for > > snapsaftey - they added warnings for certain operations that are > > incompatible with trimming (dead code elimination) and then require > > library authors to annotate methods that do generate the warnings. > > The annotations bubble up the call chain to the public apis and then > > library consumers can determine whether to call such apis or not. > > > > Building on this idea, if methods and classes are correctly annotated > > (with what annotations? tbd) it may be possible to do some analysis > > when the checkpoint is created to determine whether the current state > > is "snapsafe" or not. This is not so much a static property that can > > be statically analyzed, but one that must be checked when taking the > > checkpoint as it may require walking stacks (currently executing > > methods), examining loaded classes, heap walks(?), etc. > > Now it's possible to create a runtime check for the object state safety, > that is to create Resource's beforeCheckpoint. An unsafe object may > always throw an Exception. Won't this be even more flexible? This > relates a lot to Snapsafety of core library classes [1], I'll reply > there. In the context of my above comment on the different checkpoint mechanisms (i.e. CRIU vs. Firecracker) I was already thinking about annotating the CRaC callbacks such that they will only be called if necessary, based on the snapshotting mechanism. The question is if it will possible at all to come up with a fixed, predefined set of such abstract "snapsafety annotations" or if there are just too many different use cases and contexts? > > Thanks for bringing more context, > -- Anton > > [1] https://mail.openjdk.java.net/pipermail/crac-dev/2022-May/000222.html > From simonis at openjdk.org Wed Jun 29 08:31:38 2022 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 29 Jun 2022 08:31:38 GMT Subject: [crac] RFR: Don't crash the VM if checkpointing fails Message-ID: Currently, if checkpointing is failing (because the `criu` executable can't be find or if `criu dump` returns an error) the JVM will crash with: $ java -XX:+CRPrintResourcesOnCheckpoint -XX:CCheckpointTo=/tmp/crac HelloWait HelloWorld JVM: FD fd=0 type=character: details1="/dev/pts/9" OK: inherited from process env JVM: FD fd=1 type=character: details1="/dev/pts/9" OK: inherited from process env JVM: FD fd=2 type=character: details1="/dev/pts/9" OK: inherited from process env JVM: FD fd=3 type=regular: details1="/output/crac-dbg/images/jdk/lib/modules" OK: inherited from process env CR: Checkpoint ... ------------- Commit messages: - Don't crash the VM if checkpointing fails Changes: https://git.openjdk.org/crac/pull/25/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=25&range=00 Stats: 26 lines in 3 files changed: 0 ins; 25 del; 1 mod Patch: https://git.openjdk.org/crac/pull/25.diff Fetch: git fetch https://git.openjdk.org/crac pull/25/head:pull/25 PR: https://git.openjdk.org/crac/pull/25 From akozlov at openjdk.org Wed Jun 29 09:00:12 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 29 Jun 2022 09:00:12 GMT Subject: [crac] RFR: Don't crash the VM if checkpointing fails In-Reply-To: References: Message-ID: <3ph9_l-Q37Y62RYhf0bPmW3MT1azA5suRVsZIo7EsHE=.8468df1d-81ba-4477-8e7f-8c61dbe17e3c@github.com> On Wed, 29 Jun 2022 08:23:21 GMT, Volker Simonis wrote: > Currently, if checkpointing is failing (because the `criu` executable can't be find or if `criu dump` returns an error) the JVM will crash with: > > $ java -XX:+CRPrintResourcesOnCheckpoint -XX:CCheckpointTo=/tmp/crac HelloWait > HelloWorld > JVM: FD fd=0 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=1 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=2 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=3 type=regular: details1="/output/crac-dbg/images/jdk/lib/modules" OK: inherited from process env > CR: Checkpoint ... The change looks good. Thanks for fixing this. I was trying to optimize failure case, but at some point it apparently went out of sync with the usual path. ------------- Marked as reviewed by akozlov (Lead). PR: https://git.openjdk.org/crac/pull/25 From simonis at openjdk.org Wed Jun 29 12:12:03 2022 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 29 Jun 2022 12:12:03 GMT Subject: [crac] RFR: Don't crash the VM if checkpointing fails In-Reply-To: References: Message-ID: <5BimgxA9aUBRz3o31eFq_kmKukrY_vERn4-YctH5G3E=.d57e045a-b482-4051-87ec-5f8d5fa50b93@github.com> On Wed, 29 Jun 2022 08:23:21 GMT, Volker Simonis wrote: > Currently, if checkpointing is failing (because the `criu` executable can't be find or if `criu dump` returns an error) the JVM will crash with: > > $ java -XX:+CRPrintResourcesOnCheckpoint -XX:CCheckpointTo=/tmp/crac HelloWait > HelloWorld > JVM: FD fd=0 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=1 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=2 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=3 type=regular: details1="/output/crac-dbg/images/jdk/lib/modules" OK: inherited from process env > CR: Checkpoint ... Thanks for the quick review :) ------------- PR: https://git.openjdk.org/crac/pull/25 From simonis at openjdk.org Wed Jun 29 12:15:13 2022 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 29 Jun 2022 12:15:13 GMT Subject: [crac] Integrated: Don't crash the VM if checkpointing fails In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 08:23:21 GMT, Volker Simonis wrote: > Currently, if checkpointing is failing (because the `criu` executable can't be find or if `criu dump` returns an error) the JVM will crash with: > > $ java -XX:+CRPrintResourcesOnCheckpoint -XX:CCheckpointTo=/tmp/crac HelloWait > HelloWorld > JVM: FD fd=0 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=1 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=2 type=character: details1="/dev/pts/9" OK: inherited from process env > JVM: FD fd=3 type=regular: details1="/output/crac-dbg/images/jdk/lib/modules" OK: inherited from process env > CR: Checkpoint ... This pull request has now been integrated. Changeset: 28621cd2 Author: Volker Simonis URL: https://git.openjdk.org/crac/commit/28621cd2ef51dcb37aae5f0c414fe1d5e2475283 Stats: 26 lines in 3 files changed: 0 ins; 25 del; 1 mod Don't crash the VM if checkpointing fails Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/25 From simonis at openjdk.org Wed Jun 29 13:06:39 2022 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 29 Jun 2022 13:06:39 GMT Subject: [crac] RFR: Support extra criu flags from the environment Message-ID: When testing with different `criu` versions or `criu` configurations it might be useful to be able to modify the default `criu` command line parameters. This PR introduces the new environment variable `CRAC_CRIU_OPTS` which will be interpreted as a space separated list of `criu` flags to be appended to the hard-coded list of command line parameters. E.g. $ CRAC_CRIU_OPTS="-v4 -o resume.log -W /tmp/crac" java -XX:CRaCRestoreFrom=/tmp/crac This will set the logging level to 4 (thus overriding the hard-coded logging level of 1 for resuming), redirect the log to the file `resume.log` and change `criu`'s working directory (which will contain the log file) to `/tmp/crac`. ------------- Commit messages: - Support extra criu flags from the environment Changes: https://git.openjdk.org/crac/pull/26/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=26&range=00 Stats: 39 lines in 1 file changed: 28 ins; 7 del; 4 mod Patch: https://git.openjdk.org/crac/pull/26.diff Fetch: git fetch https://git.openjdk.org/crac pull/26/head:pull/26 PR: https://git.openjdk.org/crac/pull/26 From heidinga at redhat.com Wed Jun 29 13:22:38 2022 From: heidinga at redhat.com (Dan Heidinga) Date: Wed, 29 Jun 2022 09:22:38 -0400 Subject: CRaC + maven-daemon: An experience report In-Reply-To: References: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> <2bd4db04-7218-0d49-cd64-971cf63f98ce@azul.com> Message-ID: On Fri, Jun 17, 2022 at 2:09 PM Volker Simonis wrote: > > On Fri, Jun 10, 2022 at 3:29 PM Anton Kozlov wrote: > > > > On 6/8/22 16:19, Dan Heidinga wrote: > > >>> Snapsafety > > >> > > >> I still struggle to understand what is it. Is it a property of the code > > >> (e.g. if you use these classes, you are safe w.r.t. checkpoint and > > >> restore and don't need to coordinate explicitly)? Or is it a property of > > >> the state (the state can be safely checkpoint and restore -- what is > > >> safe in this case)? > > >> > > > > > > This is exactly the point Ashu's making in the document. As much as I > > > think we would all like snapsafety to be a static property of the > > > source code so we could analyze it easily with some static analysis, > > > it's unfortunately more complicated than a static property. > > > > > I agree. And it also depends on the snapshotting mechanism. E.g. with > CRaC+CRIU there's a lot of code to properly close/reopen files and > handle file descriptors correctly. If we are using CRaC with > Firecracker instead which takes a complete OS snapshot, that all > becomes unnecessary, because all the files will still be there with > the exact same state after the restore. With the OpenJ9 CRIU approach, we've found that by targeting containers, we don't need to force all files to be closed/reopened as CRIU handles it well. Only files that are mounted into the container need special handling and that's basically user-configuration. Not forcing files to be opened / closed side steps a whole host of problems, but makes it much harder to run on bare metal and makes the system more dependent on the checkpointing mechanism. So tradeoffs. I feel a bit like a broken record saying this (and maybe that's just due to repeating it internally for so long =), but I think the programming model is critical here. If we find a better way to express relationships between dependencies and "phases", we will end up with programs that are both more amenable to checkpoint/restore, and also to being pre-initialized (a la Leyden). This may make it harder to retrofit existing programs but will provide a more stable & most importantly, predictable base to build new programs on. > > > > > Hence my question :) Let's say that to be a property of the object > > state. Then a class has a property if all objects of the class have the > > property in all possible states. I don't see any other way for > > correctness and secureness to be defined for a class, other than > > providing Resource implementation on a per-class basis, and taking care > > of class functionality and internal invariants. That is, no mark or a > > predicate on the code or the Resource implementation can imply real > > safety -- that will always remain a non-formal property that should be > > aligned with the surrounding context in the class. For example, adding > > another field and its initializaion can change the class from safe to > > unsafe. Such change is hard to correlate with necessary changes in > > Resource implementation. > > > > > There's a temporal aspect to it - when the checkpoint is taken affects > > > the safety of the operation. When the snapshot is taken determines > > > what would need to be fixed up (and much of that is based on > > > application specific invariants). > > > > > > The execution model on restore [0] also impacts the snapsafety. As > > > Ashu says, using the checkpoint to create an initialized base image > > > has a different concept of "safety" than migrating a computation from > > > one host to another. Different pieces of state will need to be > > > modified in each case and different invariants will hold (or be > > > broken). > > > > Indeed. Is the formal property worth pursuing then? This was going to > > be a language aid for app developers to annotate safe parts of their > > programs, and for us to annotate parts of JDK. While we can attempt to > > annotate JDK correctly and fully, we cannot control how the language > > feature will be used by users. And for them, a better annotation > > mechanism or a programming model (like reactive programming) may exist. > > How about letting users decide how and when to annotate their programs, > > and concentrate on JDK needs and how JDK is used by applications first, > > as we understand these better? For example, what's missing in the JDK > > so the app won't need changing at all? And what parts of the app are > > absolutely necessary to change. Is it possible for JDK to provide a set > > of utilities to ease those changes? > > > > > The .NET community took an interesting approach in their "Native AOT" > > > story for "trimming" applications [1] that may be reusable for > > > snapsaftey - they added warnings for certain operations that are > > > incompatible with trimming (dead code elimination) and then require > > > library authors to annotate methods that do generate the warnings. > > > The annotations bubble up the call chain to the public apis and then > > > library consumers can determine whether to call such apis or not. > > > > > > Building on this idea, if methods and classes are correctly annotated > > > (with what annotations? tbd) it may be possible to do some analysis > > > when the checkpoint is created to determine whether the current state > > > is "snapsafe" or not. This is not so much a static property that can > > > be statically analyzed, but one that must be checked when taking the > > > checkpoint as it may require walking stacks (currently executing > > > methods), examining loaded classes, heap walks(?), etc. > > > > Now it's possible to create a runtime check for the object state safety, > > that is to create Resource's beforeCheckpoint. An unsafe object may > > always throw an Exception. Won't this be even more flexible? This > > relates a lot to Snapsafety of core library classes [1], I'll reply > > there. > > In the context of my above comment on the different checkpoint > mechanisms (i.e. CRIU vs. Firecracker) I was already thinking about > annotating the CRaC callbacks such that they will only be called if > necessary, based on the snapshotting mechanism. The question is if it > will possible at all to come up with a fixed, predefined set of such > abstract "snapsafety annotations" or if there are just too many > different use cases and contexts? The more points we can identify as needing fixups, the clearer a picture we'll have of the landscape. The CRaC callbacks provide one set of use cases and contexts. GraalVM's SubstrateVM Substitution mechanism provides another view of the places that need to be fixed up. OpenJ9's J9InternalCheckpointHookAPI::register{PreCheckpoint/PostRestore}Hook APIs is another data point. We're still early days in identifying the places that need to be fixed up and it will require trying to run (more) applications with CRaC to find the long tail of required fixups. If annotating the existing CRaC callbacks helps to skip some fixups and test a broader set of applications - I'm all for it! What kind of annotations were you thinking? Maybe start with Firecracker-specific ones that we can generalize from? @FirecrackerSkip? --Dan > > > > > Thanks for bringing more context, > > -- Anton > > > > [1] https://mail.openjdk.java.net/pipermail/crac-dev/2022-May/000222.html > > > From volker.simonis at gmail.com Wed Jun 29 18:44:24 2022 From: volker.simonis at gmail.com (Volker Simonis) Date: Wed, 29 Jun 2022 20:44:24 +0200 Subject: CRaC + maven-daemon: An experience report In-Reply-To: References: <3d4bf036-1700-6547-3928-5db1a5e615fc@azul.com> <2bd4db04-7218-0d49-cd64-971cf63f98ce@azul.com> Message-ID: On Wed, Jun 29, 2022 at 3:22 PM Dan Heidinga wrote: > > On Fri, Jun 17, 2022 at 2:09 PM Volker Simonis wrote: > > > > On Fri, Jun 10, 2022 at 3:29 PM Anton Kozlov wrote: > > > > > > On 6/8/22 16:19, Dan Heidinga wrote: > > > >>> Snapsafety > > > >> > > > >> I still struggle to understand what is it. Is it a property of the code > > > >> (e.g. if you use these classes, you are safe w.r.t. checkpoint and > > > >> restore and don't need to coordinate explicitly)? Or is it a property of > > > >> the state (the state can be safely checkpoint and restore -- what is > > > >> safe in this case)? > > > >> > > > > > > > > This is exactly the point Ashu's making in the document. As much as I > > > > think we would all like snapsafety to be a static property of the > > > > source code so we could analyze it easily with some static analysis, > > > > it's unfortunately more complicated than a static property. > > > > > > > > I agree. And it also depends on the snapshotting mechanism. E.g. with > > CRaC+CRIU there's a lot of code to properly close/reopen files and > > handle file descriptors correctly. If we are using CRaC with > > Firecracker instead which takes a complete OS snapshot, that all > > becomes unnecessary, because all the files will still be there with > > the exact same state after the restore. > > With the OpenJ9 CRIU approach, we've found that by targeting > containers, we don't need to force all files to be closed/reopened as > CRIU handles it well. Only files that are mounted into the container > need special handling and that's basically user-configuration. Not > forcing files to be opened / closed side steps a whole host of > problems, but makes it much harder to run on bare metal and makes the > system more dependent on the checkpointing mechanism. So tradeoffs. > That's another interesting aspect. > I feel a bit like a broken record saying this (and maybe that's just > due to repeating it internally for so long =), but I think the > programming model is critical here. If we find a better way to > express relationships between dependencies and "phases", we will end > up with programs that are both more amenable to checkpoint/restore, > and also to being pre-initialized (a la Leyden). This may make it > harder to retrofit existing programs but will provide a more stable & > most importantly, predictable base to build new programs on. > > > > > > > > > Hence my question :) Let's say that to be a property of the object > > > state. Then a class has a property if all objects of the class have the > > > property in all possible states. I don't see any other way for > > > correctness and secureness to be defined for a class, other than > > > providing Resource implementation on a per-class basis, and taking care > > > of class functionality and internal invariants. That is, no mark or a > > > predicate on the code or the Resource implementation can imply real > > > safety -- that will always remain a non-formal property that should be > > > aligned with the surrounding context in the class. For example, adding > > > another field and its initializaion can change the class from safe to > > > unsafe. Such change is hard to correlate with necessary changes in > > > Resource implementation. > > > > > > > There's a temporal aspect to it - when the checkpoint is taken affects > > > > the safety of the operation. When the snapshot is taken determines > > > > what would need to be fixed up (and much of that is based on > > > > application specific invariants). > > > > > > > > The execution model on restore [0] also impacts the snapsafety. As > > > > Ashu says, using the checkpoint to create an initialized base image > > > > has a different concept of "safety" than migrating a computation from > > > > one host to another. Different pieces of state will need to be > > > > modified in each case and different invariants will hold (or be > > > > broken). > > > > > > Indeed. Is the formal property worth pursuing then? This was going to > > > be a language aid for app developers to annotate safe parts of their > > > programs, and for us to annotate parts of JDK. While we can attempt to > > > annotate JDK correctly and fully, we cannot control how the language > > > feature will be used by users. And for them, a better annotation > > > mechanism or a programming model (like reactive programming) may exist. > > > How about letting users decide how and when to annotate their programs, > > > and concentrate on JDK needs and how JDK is used by applications first, > > > as we understand these better? For example, what's missing in the JDK > > > so the app won't need changing at all? And what parts of the app are > > > absolutely necessary to change. Is it possible for JDK to provide a set > > > of utilities to ease those changes? > > > > > > > The .NET community took an interesting approach in their "Native AOT" > > > > story for "trimming" applications [1] that may be reusable for > > > > snapsaftey - they added warnings for certain operations that are > > > > incompatible with trimming (dead code elimination) and then require > > > > library authors to annotate methods that do generate the warnings. > > > > The annotations bubble up the call chain to the public apis and then > > > > library consumers can determine whether to call such apis or not. > > > > > > > > Building on this idea, if methods and classes are correctly annotated > > > > (with what annotations? tbd) it may be possible to do some analysis > > > > when the checkpoint is created to determine whether the current state > > > > is "snapsafe" or not. This is not so much a static property that can > > > > be statically analyzed, but one that must be checked when taking the > > > > checkpoint as it may require walking stacks (currently executing > > > > methods), examining loaded classes, heap walks(?), etc. > > > > > > Now it's possible to create a runtime check for the object state safety, > > > that is to create Resource's beforeCheckpoint. An unsafe object may > > > always throw an Exception. Won't this be even more flexible? This > > > relates a lot to Snapsafety of core library classes [1], I'll reply > > > there. > > > > In the context of my above comment on the different checkpoint > > mechanisms (i.e. CRIU vs. Firecracker) I was already thinking about > > annotating the CRaC callbacks such that they will only be called if > > necessary, based on the snapshotting mechanism. The question is if it > > will possible at all to come up with a fixed, predefined set of such > > abstract "snapsafety annotations" or if there are just too many > > different use cases and contexts? > > The more points we can identify as needing fixups, the clearer a > picture we'll have of the landscape. The CRaC callbacks provide one > set of use cases and contexts. GraalVM's SubstrateVM Substitution > mechanism provides another view of the places that need to be fixed > up. OpenJ9's J9InternalCheckpointHookAPI::register{PreCheckpoint/PostRestore}Hook > APIs is another data point. > > We're still early days in identifying the places that need to be fixed > up and it will require trying to run (more) applications with CRaC to > find the long tail of required fixups. If annotating the existing > CRaC callbacks helps to skip some fixups and test a broader set of > applications - I'm all for it! > > What kind of annotations were you thinking? Maybe start with > Firecracker-specific ones that we can generalize from? > @FirecrackerSkip? > I was more thinking about a kind of "semantic" annotations like "File", "Network", etc. which could be interpreted accordingly based on the snapshotting mechanism. E.g. if snapshotting with Firecracker we could skip all the file relevant fixups, with CRIU we would need to care about them. But your above example with CRIU in containers is interesting because it shows that we need a more fine grained distinction here between mounted and local files. > --Dan > > > > > > > > > Thanks for bringing more context, > > > -- Anton > > > > > > [1] https://mail.openjdk.java.net/pipermail/crac-dev/2022-May/000222.html > > > > > > From akozlov at openjdk.org Thu Jun 30 06:48:01 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 30 Jun 2022 06:48:01 GMT Subject: [crac] RFR: Support extra criu flags from the environment In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 13:00:59 GMT, Volker Simonis wrote: > When testing with different `criu` versions or `criu` configurations it might be useful to be able to modify the default `criu` command line parameters. This PR introduces the new environment variable `CRAC_CRIU_OPTS` which will be interpreted as a space separated list of `criu` flags to be appended to the hard-coded list of command line parameters. E.g. > > > $ CRAC_CRIU_OPTS="-v4 -o resume.log -W /tmp/crac" java -XX:CRaCRestoreFrom=/tmp/crac > > > This will set the logging level to 4 (thus overriding the hard-coded logging level of 1 for resuming), redirect the log to the file `resume.log` and change `criu`'s working directory (which will contain the log file) to `/tmp/crac`. Changes requested by akozlov (Lead). src/java.base/unix/native/criuengine/criuengine.c line 116: > 114: } > 115: *arg++ = NULL; > 116: assert(ARRAY_SIZE(args) >= (size_t)(arg - args)); I'm not sure if we are compiling with asserts. Could you turn this and another assert below into a runtime check and explicit failure if arguments do not fit? ------------- PR: https://git.openjdk.org/crac/pull/26 From simonis at openjdk.org Thu Jun 30 17:12:02 2022 From: simonis at openjdk.org (Volker Simonis) Date: Thu, 30 Jun 2022 17:12:02 GMT Subject: [crac] RFR: Support extra criu flags from the environment [v2] In-Reply-To: References: Message-ID: > When testing with different `criu` versions or `criu` configurations it might be useful to be able to modify the default `criu` command line parameters. This PR introduces the new environment variable `CRAC_CRIU_OPTS` which will be interpreted as a space separated list of `criu` flags to be appended to the hard-coded list of command line parameters. E.g. > > > $ CRAC_CRIU_OPTS="-v4 -o resume.log -W /tmp/crac" java -XX:CRaCRestoreFrom=/tmp/crac > > > This will set the logging level to 4 (thus overriding the hard-coded logging level of 1 for resuming), redirect the log to the file `resume.log` and change `criu`'s working directory (which will contain the log file) to `/tmp/crac`. Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: Drop surplus arguments from CRAC_CRIU_OPTS and issue a warning ------------- Changes: - all: https://git.openjdk.org/crac/pull/26/files - new: https://git.openjdk.org/crac/pull/26/files/35412641..df71ae8d Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=26&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=26&range=00-01 Stats: 18 lines in 1 file changed: 10 ins; 6 del; 2 mod Patch: https://git.openjdk.org/crac/pull/26.diff Fetch: git fetch https://git.openjdk.org/crac pull/26/head:pull/26 PR: https://git.openjdk.org/crac/pull/26 From simonis at openjdk.org Thu Jun 30 17:12:04 2022 From: simonis at openjdk.org (Volker Simonis) Date: Thu, 30 Jun 2022 17:12:04 GMT Subject: [crac] RFR: Support extra criu flags from the environment [v2] In-Reply-To: References: Message-ID: <0TXno9yrXdzw6SZmmkR_7myXQyloTxgEoybyoz7S6cE=.e91b9f92-6e7e-4475-a7bd-94bf348ce1cc@github.com> On Thu, 30 Jun 2022 06:43:52 GMT, Anton Kozlov wrote: >> Volker Simonis has updated the pull request incrementally with one additional commit since the last revision: >> >> Drop surplus arguments from CRAC_CRIU_OPTS and issue a warning > > src/java.base/unix/native/criuengine/criuengine.c line 116: > >> 114: } >> 115: *arg++ = NULL; >> 116: assert(ARRAY_SIZE(args) >= (size_t)(arg - args)); > > I'm not sure if we are compiling with asserts. Could you turn this and another assert below into a runtime check and explicit failure if arguments do not fit? You're right. We compile product builds with `-DNDEBUG`. I've reworked the code to drop surplus arguments from `CRAC_CRIU_OPTS` and issue a warning if this was necessary. ------------- PR: https://git.openjdk.org/crac/pull/26