From duke at openjdk.org Mon Oct 3 08:05:23 2022 From: duke at openjdk.org (Roman Marchenko) Date: Mon, 3 Oct 2022 08:05:23 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v2] In-Reply-To: References: Message-ID: <_jJh7kod6xG23iUiMv4UYppb8hIFh2HnQQt7SRasPIw=.714c693d-b64d-4cb7-8d79-568a282162bd@github.com> > This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. > > Env propagation is done in few steps: > - Store the actual environment before restoring > - After restoring, replace the restored `environ` with a new one. > - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/os/linux/os_linux.cpp Fixing review comments Co-authored-by: Anton Kozlov ------------- Changes: - all: https://git.openjdk.org/crac/pull/30/files - new: https://git.openjdk.org/crac/pull/30/files/0270dc4b..8eb394e7 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=30&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=30&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/crac/pull/30.diff Fetch: git fetch https://git.openjdk.org/crac pull/30/head:pull/30 PR: https://git.openjdk.org/crac/pull/30 From duke at openjdk.org Mon Oct 3 08:21:56 2022 From: duke at openjdk.org (Roman Marchenko) Date: Mon, 3 Oct 2022 08:21:56 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v3] In-Reply-To: References: Message-ID: > This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. > > Env propagation is done in few steps: > - Store the actual environment before restoring > - After restoring, replace the restored `environ` with a new one. > - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: Fixing review comments ------------- Changes: - all: https://git.openjdk.org/crac/pull/30/files - new: https://git.openjdk.org/crac/pull/30/files/8eb394e7..46792f4c Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=30&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=30&range=01-02 Stats: 11 lines in 1 file changed: 0 ins; 2 del; 9 mod Patch: https://git.openjdk.org/crac/pull/30.diff Fetch: git fetch https://git.openjdk.org/crac pull/30/head:pull/30 PR: https://git.openjdk.org/crac/pull/30 From duke at openjdk.org Mon Oct 3 14:26:39 2022 From: duke at openjdk.org (Roman Marchenko) Date: Mon, 3 Oct 2022 14:26:39 GMT Subject: [crac] RFR: Environment vars propagation into restored process In-Reply-To: <4bABOyN_ecZeOziPY8Wsmfbco0VPwgYWNiD4IZFSntA=.3ca718e6-f1db-469e-a466-b55fe9ebde17@github.com> References: <4bABOyN_ecZeOziPY8Wsmfbco0VPwgYWNiD4IZFSntA=.3ca718e6-f1db-469e-a466-b55fe9ebde17@github.com> Message-ID: On Fri, 30 Sep 2022 14:35:29 GMT, Dan Heidinga wrote: >> This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. >> >> Env propagation is done in few steps: >> - Store the actual environment before restoring >> - After restoring, replace the restored `environ` with a new one. >> - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. > > One concern with this approach - it means that environment variables will change values after a restore. > > It seems odd to say this is a concern when it's the intended behaviour of this PR but it is a concern. Users typically cache environment variables in static fields or use them to make a one time decision. They don't expect them (at least at the Java layer) to change value throughout a run of the same process. > > This change means two reads of the same env var can give different results at different times which may put unsuspecting applications into inconsistent states if two locations read the env var before vs after a restore. That's going to be a hard to debug issue. > > The VM may also read env vars and bind tightly to the value. Native code after a restore will still have the original env while java code the modified env. Do we foresee any issues there? @DanHeidinga Hi, You're right in your concerns. Indeed the suggested enhancement changes the usual workflow, so users may be confused. That is why we expect users to explicitly adapt their applications in accordance with the behaviour and make sure it works, otherwise there is no guarantee the application run with CRaC is successful. ------------- PR: https://git.openjdk.org/crac/pull/30 From heidinga at openjdk.org Mon Oct 3 15:24:38 2022 From: heidinga at openjdk.org (Dan Heidinga) Date: Mon, 3 Oct 2022 15:24:38 GMT Subject: [crac] RFR: Environment vars propagation into restored process In-Reply-To: References: <4bABOyN_ecZeOziPY8Wsmfbco0VPwgYWNiD4IZFSntA=.3ca718e6-f1db-469e-a466-b55fe9ebde17@github.com> Message-ID: On Mon, 3 Oct 2022 14:23:04 GMT, Roman Marchenko wrote: >> One concern with this approach - it means that environment variables will change values after a restore. >> >> It seems odd to say this is a concern when it's the intended behaviour of this PR but it is a concern. Users typically cache environment variables in static fields or use them to make a one time decision. They don't expect them (at least at the Java layer) to change value throughout a run of the same process. >> >> This change means two reads of the same env var can give different results at different times which may put unsuspecting applications into inconsistent states if two locations read the env var before vs after a restore. That's going to be a hard to debug issue. >> >> The VM may also read env vars and bind tightly to the value. Native code after a restore will still have the original env while java code the modified env. Do we foresee any issues there? > > @DanHeidinga > Hi, > You're right in your concerns. Indeed the suggested enhancement changes the usual workflow, so users may be confused. > That is why we expect users to explicitly adapt their applications in accordance with the behaviour and make sure it works, otherwise there is no guarantee the application run with CRaC is successful. @wkia You're right the users will need to adapt their applications to work with CRaC. 100% agree there. The challenge for them will be when they use 3rd party libraries or update their existing applications to work. It's really easy to miss updating something or not realize the full blast radius of changes requiring updates when an env var becomes "stale" after a restart. To be safe, I think we need to review the use of env vars in the JDK and ensure that both the native code and the class libraries take correct action on changed env vars. We should also consider doing something similar to the OpenJ9 approach where we restrict the set of env vars available prior to the checkpoint (minimize the accidental use of checkpoint env), and limit the env var changes to only add new env vars (no inconsistencies). This got them a long ways in their work with Liberty though they did find it necessary to eventually support overriding some env vars. With the approach in this PR, it will be hard for service engineers to know what the original env was and to debug issues related to changed env vars. Are there bread crumbs we can leave to make that service work go more smoothly? ------------- PR: https://git.openjdk.org/crac/pull/30 From jkratochvil at azul.com Tue Oct 4 09:16:01 2022 From: jkratochvil at azul.com (Jan Kratochvil) Date: Tue, 4 Oct 2022 11:16:01 +0200 Subject: [crac] RFR: Environment vars propagation into restored process In-Reply-To: References: Message-ID: On Thu, 29 Sep 2022 17:45:01 +0200, Roman Marchenko wrote: > This PR provides functionality to propagate actual environment variables to > a restored process, as well as the test for this functionality. It would be nice to know what was the reason for introducing this feature. This mail thread discusses its disadvantages but not its advantages. Jan From heidinga at redhat.com Tue Oct 4 13:03:36 2022 From: heidinga at redhat.com (Dan Heidinga) Date: Tue, 4 Oct 2022 09:03:36 -0400 Subject: [crac] RFR: Environment vars propagation into restored process In-Reply-To: References: Message-ID: On Tue, Oct 4, 2022 at 5:16 AM Jan Kratochvil wrote: > On Thu, 29 Sep 2022 17:45:01 +0200, Roman Marchenko wrote: > > This PR provides functionality to propagate actual environment variables > to > > a restored process, as well as the test for this functionality. > > It would be nice to know what was the reason for introducing this feature. > When deploying a container to K8, it's pretty common to configure the application using env vars - things like connection ports, host names, etc get injected into the container. If a checkpoint is taken in the CI environment, we don't want to bind in those env vars as it's too early to know what the values will be for a particular deployment. If the restores happen when the image is deployed, we need a way to inject the final configuration data (the env vars) into the restored image. > > This mail thread discusses its disadvantages but not its advantages. > Less so the disadvantages and more consequences - the ability to set env vars on restore is important for containers but it has knock on effects. Existing applications haven't been architected to expect this so there's some corner cases that need to be worked through. I think we need this kind of ability for CRaC but am now looking at how do we design it so that it limits the risk to applications, is easy to service, and has a clear model that developers and users can reason about. --Dan > > > Jan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkratochvil at azul.com Tue Oct 4 13:48:44 2022 From: jkratochvil at azul.com (Jan Kratochvil) Date: Tue, 4 Oct 2022 15:48:44 +0200 Subject: [crac] RFR: Environment vars propagation into restored process In-Reply-To: References: Message-ID: On Tue, 04 Oct 2022 15:03:36 +0200, Dan Heidinga wrote: > When deploying a container to K8, it's pretty common to configure the > application using env vars - things like connection ports, host names, etc > get injected into the container. If a checkpoint is taken in the CI > environment, we don't want to bind in those env vars as it's too early to > know what the values will be for a particular deployment. If the restores > happen when the image is deployed, we need a way to inject the final > configuration data (the env vars) into the restored image. OK, that makes sense, thanks. Jan From duke at openjdk.org Wed Oct 5 11:54:50 2022 From: duke at openjdk.org (Roman Marchenko) Date: Wed, 5 Oct 2022 11:54:50 GMT Subject: [crac] RFR: Environment vars propagation into restored process In-Reply-To: References: <4bABOyN_ecZeOziPY8Wsmfbco0VPwgYWNiD4IZFSntA=.3ca718e6-f1db-469e-a466-b55fe9ebde17@github.com> Message-ID: On Mon, 3 Oct 2022 15:21:05 GMT, Dan Heidinga wrote: >> @DanHeidinga >> Hi, >> You're right in your concerns. Indeed the suggested enhancement changes the usual workflow, so users may be confused. >> That is why we expect users to explicitly adapt their applications in accordance with the behaviour and make sure it works, otherwise there is no guarantee the application run with CRaC is successful. > > @wkia You're right the users will need to adapt their applications to work with CRaC. 100% agree there. > > The challenge for them will be when they use 3rd party libraries or update their existing applications to work. It's really easy to miss updating something or not realize the full blast radius of changes requiring updates when an env var becomes "stale" after a restart. > > To be safe, I think we need to review the use of env vars in the JDK and ensure that both the native code and the class libraries take correct action on changed env vars. > > We should also consider doing something similar to the OpenJ9 approach where we restrict the set of env vars available prior to the checkpoint (minimize the accidental use of checkpoint env), and limit the env var changes to only add new env vars (no inconsistencies). This got them a long ways in their work with Liberty though they did find it necessary to eventually support overriding some env vars. > > With the approach in this PR, it will be hard for service engineers to know what the original env was and to debug issues related to changed env vars. Are there bread crumbs we can leave to make that service work go more smoothly? @DanHeidinga For a simple scenario, when nothing is changed in the environment, user applications don't need to be changed, the applications work out-of-the-box. Could we consider a scenario when a container has a different environment for restoring a process rather than the environment it was checkpoint'ed? The different environment means something has changed in the system, it doesn't matter why. In case the application is not prepared and the process doesn't expect that environment could be changed after restoration (and/or doesn't handle this correctly), the process may have outdated view on the env. So the process may need to be reconfigured to continue working. It's not about particular env var values, but system changes. In case we don't propagate all the changed env vars to the process, the process doesn't have a chance to reconfigure itself. So propagation of all the changed env vars seems necessary. Of course, users need to make necessary changes to prepare their applications for checkpoint/restore events, otherwise the applications couldn't work properly. Speaking about debugging, currently users are able to create simple apps to print out vars, see the example below: > java -XX:CRaCRestoreFrom=./restore_folder TestApp public class TestApp { public static void main(String args[]) throws Exception { for (Map.Entry e : System.getenv().entrySet()) { System.out.println(e.getKey() + " = " + e.getValue()); } } } ------------- PR: https://git.openjdk.org/crac/pull/30 From heidinga at openjdk.org Wed Oct 5 13:32:01 2022 From: heidinga at openjdk.org (Dan Heidinga) Date: Wed, 5 Oct 2022 13:32:01 GMT Subject: [crac] RFR: Environment vars propagation into restored process In-Reply-To: References: <4bABOyN_ecZeOziPY8Wsmfbco0VPwgYWNiD4IZFSntA=.3ca718e6-f1db-469e-a466-b55fe9ebde17@github.com> Message-ID: On Mon, 3 Oct 2022 15:21:05 GMT, Dan Heidinga wrote: >> @DanHeidinga >> Hi, >> You're right in your concerns. Indeed the suggested enhancement changes the usual workflow, so users may be confused. >> That is why we expect users to explicitly adapt their applications in accordance with the behaviour and make sure it works, otherwise there is no guarantee the application run with CRaC is successful. > > @wkia You're right the users will need to adapt their applications to work with CRaC. 100% agree there. > > The challenge for them will be when they use 3rd party libraries or update their existing applications to work. It's really easy to miss updating something or not realize the full blast radius of changes requiring updates when an env var becomes "stale" after a restart. > > To be safe, I think we need to review the use of env vars in the JDK and ensure that both the native code and the class libraries take correct action on changed env vars. > > We should also consider doing something similar to the OpenJ9 approach where we restrict the set of env vars available prior to the checkpoint (minimize the accidental use of checkpoint env), and limit the env var changes to only add new env vars (no inconsistencies). This got them a long ways in their work with Liberty though they did find it necessary to eventually support overriding some env vars. > > With the approach in this PR, it will be hard for service engineers to know what the original env was and to debug issues related to changed env vars. Are there bread crumbs we can leave to make that service work go more smoothly? > @DanHeidinga For a simple scenario, when nothing is changed in the environment, user applications don't need to be changed, the applications work out-of-the-box. > Agreed. We restore the process and the environment is the same so the application code doesn't need to be updated to make "new" decisions regarding env vars. > Could we consider a scenario when a container has a different environment for restoring a process rather than the environment it was checkpoint'ed? The different environment means something has changed in the system, it doesn't matter why. When using CRIU, we're restoring a full process so on restore, the env is the **same** as it was at the checkpoint time. The only way to make it different is to **inject** something. We're not talking just about env changes but about how to inject new configuration into the system so that system can respond to the change. > In case the application is not prepared and the process doesn't expect that environment could be changed after restoration (and/or doesn't handle this correctly), the process may have outdated view on the env. So the process may need to be reconfigured to continue working. We agree here. I want to point out the challenge here is that most Java applications are not written to expect env vars to change. We have a huge body of applications and libraries that expect the env to be stable and allowing the env to change will result in strange inconsistencies from 3rd party code that the user has no idea ever used the env for anything. Think of those old dusty jar files that no one has source for any more but is still in wide use. Those are the dark corners these kinds of changes trip over. Java has no standard mechanism for updating env vars so applications don't expect them to change. > It's not about particular env var values, but system changes. In case we don't propagate all the changed env vars to the process, the process doesn't have a chance to reconfigure itself. So propagation of all the changed env vars seems necessary. So we need a way to inject new configuration data into the restored application. Env vars are a convenient way to do this as they are already used when deploying containers to inject configuration. Wholesale replacement of the env seems like a really scary way to pull the rug out from under existing applications that may have been only partially configured at the time of the checkpoint. A full scale env replacement mechanism requires every existing library needs to be reviewed for use of System.getEnv and updated to reconfigure if their env vars change. And that means all users of those libraries also need to be reviewed and potentially reconfigured. It makes adoption harder and less safe. By limiting access to the env prior to checkpoint (only a subset of env vars are available, maybe configurable?), allowing new env vars to be injected at restore, and having a limited way to override env vars, we contain the potential side effects and allow users to reason about the code they are going to run more easily. There's a really good discussion of this problem in the OpenJ9 issues [0] and you can see how the design evolved from add only, to eventually a limited amount of override. There's also a writeup from the GraalVM team about capturing build time state that applies here too. [1] Both are worth a read. [0] https://github.com/eclipse-openj9/openj9/issues/13545 [1] https://github.com/graalvm/taming-build-time-initialization#host-machine-data-leakage > Of course, users need to make necessary changes to prepare their applications for checkpoint/restore events, otherwise the applications couldn't work properly. > Agreed. Our design here can make adoption easier or harder. The more we can take into account the existing code that applications depend on, the easier we can make the adoption path. We don't want to make the old code behave in new ways, but we can make it safer (or not) to use that old code. > Speaking about debugging, currently users are able to create simple apps to print out vars, see the example below: > > ``` > > java -XX:CRaCRestoreFrom=./restore_folder TestApp > > public class TestApp { > public static void main(String args[]) throws Exception { > for (Map.Entry e : System.getenv().entrySet()) { > System.out.println(e.getKey() + " = " + e.getValue()); > } > } > } > ``` This is a one really cool capability with CRaC in that we can run different applications from the same checkpoint. It doesn't really address the serviceability concerns though - when dealing with end user problems, we might get the `java -version` output and a system core file. To make it debuggable we need a way to see that the env vars were changed to avoid chasing inconsistencies related to timing of env var lookup (pre checkpoint vs post restore). ------------- PR: https://git.openjdk.org/crac/pull/30 From duke at openjdk.org Fri Oct 7 13:20:07 2022 From: duke at openjdk.org (Roman Marchenko) Date: Fri, 7 Oct 2022 13:20:07 GMT Subject: [crac] RFR: Environment vars propagation into restored process In-Reply-To: References: <4bABOyN_ecZeOziPY8Wsmfbco0VPwgYWNiD4IZFSntA=.3ca718e6-f1db-469e-a466-b55fe9ebde17@github.com> Message-ID: <9vdQjImAi-PFkW5o9UpjhJlaY2Olw7D0OYCfaW9otfE=.7ca116d0-1372-48f0-8e54-ea3b1dcac117@github.com> On Wed, 5 Oct 2022 13:27:58 GMT, Dan Heidinga wrote: >> @wkia You're right the users will need to adapt their applications to work with CRaC. 100% agree there. >> >> The challenge for them will be when they use 3rd party libraries or update their existing applications to work. It's really easy to miss updating something or not realize the full blast radius of changes requiring updates when an env var becomes "stale" after a restart. >> >> To be safe, I think we need to review the use of env vars in the JDK and ensure that both the native code and the class libraries take correct action on changed env vars. >> >> We should also consider doing something similar to the OpenJ9 approach where we restrict the set of env vars available prior to the checkpoint (minimize the accidental use of checkpoint env), and limit the env var changes to only add new env vars (no inconsistencies). This got them a long ways in their work with Liberty though they did find it necessary to eventually support overriding some env vars. >> >> With the approach in this PR, it will be hard for service engineers to know what the original env was and to debug issues related to changed env vars. Are there bread crumbs we can leave to make that service work go more smoothly? > >> @DanHeidinga For a simple scenario, when nothing is changed in the environment, user applications don't need to be changed, the applications work out-of-the-box. >> > > Agreed. We restore the process and the environment is the same so the application code doesn't need to be updated to make "new" decisions regarding env vars. > >> Could we consider a scenario when a container has a different environment for restoring a process rather than the environment it was checkpoint'ed? The different environment means something has changed in the system, it doesn't matter why. > > When using CRIU, we're restoring a full process so on restore, the env is the **same** as it was at the checkpoint time. The only way to make it different is to **inject** something. We're not talking just about env changes but about how to inject new configuration into the system so that system can respond to the change. > >> In case the application is not prepared and the process doesn't expect that environment could be changed after restoration (and/or doesn't handle this correctly), the process may have outdated view on the env. So the process may need to be reconfigured to continue working. > > We agree here. I want to point out the challenge here is that most Java applications are not written to expect env vars to change. We have a huge body of applications and libraries that expect the env to be stable and allowing the env to change will result in strange inconsistencies from 3rd party code that the user has no idea ever used the env for anything. > > Think of those old dusty jar files that no one has source for any more but is still in wide use. Those are the dark corners these kinds of changes trip over. > > Java has no standard mechanism for updating env vars so applications don't expect them to change. > >> It's not about particular env var values, but system changes. In case we don't propagate all the changed env vars to the process, the process doesn't have a chance to reconfigure itself. So propagation of all the changed env vars seems necessary. > > So we need a way to inject new configuration data into the restored application. Env vars are a convenient way to do this as they are already used when deploying containers to inject configuration. Wholesale replacement of the env seems like a really scary way to pull the rug out from under existing applications that may have been only partially configured at the time of the checkpoint. > > A full scale env replacement mechanism requires every existing library needs to be reviewed for use of System.getEnv and updated to reconfigure if their env vars change. And that means all users of those libraries also need to be reviewed and potentially reconfigured. It makes adoption harder and less safe. > > By limiting access to the env prior to checkpoint (only a subset of env vars are available, maybe configurable?), allowing new env vars to be injected at restore, and having a limited way to override env vars, we contain the potential side effects and allow users to reason about the code they are going to run more easily. > > There's a really good discussion of this problem in the OpenJ9 issues [0] and you can see how the design evolved from add only, to eventually a limited amount of override. There's also a writeup from the GraalVM team about capturing build time state that applies here too. [1] Both are worth a read. > > [0] https://github.com/eclipse-openj9/openj9/issues/13545 > [1] https://github.com/graalvm/taming-build-time-initialization#host-machine-data-leakage > >> Of course, users need to make necessary changes to prepare their applications for checkpoint/restore events, otherwise the applications couldn't work properly. >> > > Agreed. Our design here can make adoption easier or harder. The more we can take into account the existing code that applications depend on, the easier we can make the adoption path. We don't want to make the old code behave in new ways, but we can make it safer (or not) to use that old code. > >> Speaking about debugging, currently users are able to create simple apps to print out vars, see the example below: >> >> ``` >> > java -XX:CRaCRestoreFrom=./restore_folder TestApp >> >> public class TestApp { >> public static void main(String args[]) throws Exception { >> for (Map.Entry e : System.getenv().entrySet()) { >> System.out.println(e.getKey() + " = " + e.getValue()); >> } >> } >> } >> ``` > > This is a one really cool capability with CRaC in that we can run different applications from the same checkpoint. It doesn't really address the serviceability concerns though - when dealing with end user problems, we might get the `java -version` output and a system core file. To make it debuggable we need a way to see that the env vars were changed to avoid chasing inconsistencies related to timing of env var lookup (pre checkpoint vs post restore). @DanHeidinga In case of env propagation there is a chance to get an app in some inconsistent state (when app does both caches a var and reads it via getenv in different peices of code.) That's true, but only for old applications which are not prepared for checkpoint/restore events. There is no problem for new apps adapted for C/R. (And as I mentioned before, an old appswould have problems anyway in case of changed env not propagated to the app.) OTOH there are env vars which depend on each other. In case we only add new vars, there is the same chance to get the inconsistent state because of new var is propagated, but its dependent var not (due to it already exists). This might be a problem for both old apps and new apps. Another yet scnario, an app uses env var to know a port numer to open. After the app restoration it turns that the port is already busy, so we could change the var value to reconfigure the app, but the app cannot reconfigure itself because we don't propagate env into the restored app. I guess there may be a lot similar scenarios. ------------- PR: https://git.openjdk.org/crac/pull/30 From akozlov at openjdk.org Tue Oct 11 09:23:38 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 11 Oct 2022 09:23:38 GMT Subject: [crac] RFR: Disable rseq in libc on checkpoint Message-ID: Restartable sequences (rseq) [0] may be used by the glibc [1]. Without proper support of rseq in the ptrace [2], CRIU fails to create the checkpoint [3]. Some paravirtualized environments like Docker on Mac, and WSL, which are commonly used during development, still do not provide a proper rseq support, leading to the error. A simple usability workaround is to disable rseq in glibc. The patch disables rseq if JVM is started with -XX:CRaCheckpointTo and there is no explicit setting of rseq for glibc. The workaround is not going to live forever, just until rseq support is implemented in the majority of environments. Alternatives like more clever detection of rseq in the ptrace, or detection of a paravirtualized environment seem too complex, having that positive impact from rseq on java performance is unknown. [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7822b1e24f2 [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=95e114a0919d844d8fe07839cb6538b7f5ee920e. [2] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/kerndat.c#L944 [3] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/cr-dump.c#L225 ------------- Commit messages: - Fix possible null string formatting - Disable rseq in case the kernel does not support ptrace for rseq Changes: https://git.openjdk.org/crac/pull/31/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=31&range=00 Stats: 35 lines in 1 file changed: 35 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/31.diff Fetch: git fetch https://git.openjdk.org/crac pull/31/head:pull/31 PR: https://git.openjdk.org/crac/pull/31 From akozlov at openjdk.org Tue Oct 11 10:36:47 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 11 Oct 2022 10:36:47 GMT Subject: [crac] RFR: Add CRAC_CRIU_LEAVE_RUNNING option Message-ID: The patch adds an option to the CRaC-CRIU glue code to continue running the original instance after the checkpoint. The central part is adding the right option to the CRIU command line. But after the checkpoint is done by CRIU, it's also necessary to communicate to the JVM that it can continue. ------------- Commit messages: - Process leave_running before buffer overflow possible - Add CRAC_CRIU_LEAVE_RUNNING option Changes: https://git.openjdk.org/crac/pull/32/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=32&range=00 Stats: 79 lines in 2 files changed: 79 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/32.diff Fetch: git fetch https://git.openjdk.org/crac pull/32/head:pull/32 PR: https://git.openjdk.org/crac/pull/32 From duke at openjdk.org Tue Oct 11 13:40:28 2022 From: duke at openjdk.org (Roman Marchenko) Date: Tue, 11 Oct 2022 13:40:28 GMT Subject: [crac] RFR: Disable rseq in libc on checkpoint In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 09:15:31 GMT, Anton Kozlov wrote: > Restartable sequences (rseq) [0] may be used by the glibc [1]. Without proper support of rseq in the ptrace [2], CRIU fails to create the checkpoint [3]. Some paravirtualized environments like Docker on Mac, and WSL, which are commonly used during development, still do not provide a proper rseq support, leading to the error. A simple usability workaround is to disable rseq in glibc. The patch disables rseq if JVM is started with -XX:CRaCheckpointTo and there is no explicit setting of rseq for glibc. The workaround is not going to live forever, just until rseq support is implemented in the majority of environments. Alternatives like more clever detection of rseq in the ptrace, or detection of a paravirtualized environment seem too complex, having that positive impact from rseq on java performance is unknown. > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7822b1e24f2 > [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=95e114a0919d844d8fe07839cb6538b7f5ee920e. > [2] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/kerndat.c#L944 > [3] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/cr-dump.c#L225 The fix seems working on WSL. Checked on Ubuntu 22.04 (5.10.16.3-microsoft-standard-WSL2 x86_64 GNU/Linux) ------------- PR: https://git.openjdk.org/crac/pull/31 From simonis at openjdk.org Wed Oct 12 07:25:55 2022 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 12 Oct 2022 07:25:55 GMT Subject: [crac] RFR: Disable rseq in libc on checkpoint In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 09:15:31 GMT, Anton Kozlov wrote: > Restartable sequences (rseq) [0] may be used by the glibc [1]. Without proper support of rseq in the ptrace [2], CRIU fails to create the checkpoint [3]. Some paravirtualized environments like Docker on Mac, and WSL, which are commonly used during development, still do not provide a proper rseq support, leading to the error. A simple usability workaround is to disable rseq in glibc. The patch disables rseq if JVM is started with -XX:CRaCheckpointTo and there is no explicit setting of rseq for glibc. The workaround is not going to live forever, just until rseq support is implemented in the majority of environments. Alternatives like more clever detection of rseq in the ptrace, or detection of a paravirtualized environment seem too complex, having that positive impact from rseq on java performance is unknown. > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7822b1e24f2 > [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=95e114a0919d844d8fe07839cb6538b7f5ee920e. > [2] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/kerndat.c#L944 > [3] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/cr-dump.c#L225 This fix seems overly complicated and I don't like the fact that we exec in the launcher. This doesn't seem to be a very common use case so why do we not simply document it and instruct users who are affected to set the environment variable themselves? That looks like a pretty simple workaround to me. ------------- PR: https://git.openjdk.org/crac/pull/31 From simonis at openjdk.org Wed Oct 12 07:30:52 2022 From: simonis at openjdk.org (Volker Simonis) Date: Wed, 12 Oct 2022 07:30:52 GMT Subject: [crac] RFR: Add CRAC_CRIU_LEAVE_RUNNING option In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 10:31:46 GMT, Anton Kozlov wrote: > The patch adds an option to the CRaC-CRIU glue code to continue running the original instance after the checkpoint. The central part is adding the right option to the CRIU command line. But after the checkpoint is done by CRIU, it's also necessary to communicate to the JVM that it can continue. Looks good time. ------------- Marked as reviewed by simonis (Committer). PR: https://git.openjdk.org/crac/pull/32 From akozlov at openjdk.org Wed Oct 12 11:28:37 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 12 Oct 2022 11:28:37 GMT Subject: [crac] RFR: Add CRAC_CRIU_LEAVE_RUNNING option In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 10:31:46 GMT, Anton Kozlov wrote: > The patch adds an option to the CRaC-CRIU glue code to continue running the original instance after the checkpoint. The central part is adding the right option to the CRIU command line. But after the checkpoint is done by CRIU, it's also necessary to communicate to the JVM that it can continue. Thanks! ------------- PR: https://git.openjdk.org/crac/pull/32 From akozlov at openjdk.org Wed Oct 12 11:30:20 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 12 Oct 2022 11:30:20 GMT Subject: [crac] Integrated: Add CRAC_CRIU_LEAVE_RUNNING option In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 10:31:46 GMT, Anton Kozlov wrote: > The patch adds an option to the CRaC-CRIU glue code to continue running the original instance after the checkpoint. The central part is adding the right option to the CRIU command line. But after the checkpoint is done by CRIU, it's also necessary to communicate to the JVM that it can continue. This pull request has now been integrated. Changeset: 4cb7a965 Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/4cb7a965194623f49b8751a80ec44357cdd3a951 Stats: 79 lines in 2 files changed: 79 ins; 0 del; 0 mod Add CRAC_CRIU_LEAVE_RUNNING option Reviewed-by: simonis ------------- PR: https://git.openjdk.org/crac/pull/32 From akozlov at openjdk.org Wed Oct 12 11:37:34 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 12 Oct 2022 11:37:34 GMT Subject: [crac] RFR: Disable rseq in libc on checkpoint In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 09:15:31 GMT, Anton Kozlov wrote: > Restartable sequences (rseq) [0] may be used by the glibc [1]. Without proper support of rseq in the ptrace [2], CRIU fails to create the checkpoint [3]. Some paravirtualized environments like Docker on Mac, and WSL, which are commonly used during development, still do not provide a proper rseq support, leading to the error. A simple usability workaround is to disable rseq in glibc. The patch disables rseq if JVM is started with -XX:CRaCheckpointTo and there is no explicit setting of rseq for glibc. The workaround is not going to live forever, just until rseq support is implemented in the majority of environments. Alternatives like more clever detection of rseq in the ptrace, or detection of a paravirtualized environment seem too complex, having that positive impact from rseq on java performance is unknown. > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7822b1e24f2 > [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=95e114a0919d844d8fe07839cb6538b7f5ee920e. > [2] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/kerndat.c#L944 > [3] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/cr-dump.c#L225 Hmm, that's reasonable. Since we anyway need to mention how to disable the workaround, documenting the workaround seems to be enough indeed. Thanks for looking! I'm closing this PR. ------------- PR: https://git.openjdk.org/crac/pull/31 From akozlov at openjdk.org Wed Oct 12 11:37:34 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 12 Oct 2022 11:37:34 GMT Subject: [crac] Withdrawn: Disable rseq in libc on checkpoint In-Reply-To: References: Message-ID: <_T8FgamSiw6sSCwL3XQ5K4VySe5YvEhM40Y71xo97JE=.e8587566-76a0-47bd-90a9-91f507885368@github.com> On Tue, 11 Oct 2022 09:15:31 GMT, Anton Kozlov wrote: > Restartable sequences (rseq) [0] may be used by the glibc [1]. Without proper support of rseq in the ptrace [2], CRIU fails to create the checkpoint [3]. Some paravirtualized environments like Docker on Mac, and WSL, which are commonly used during development, still do not provide a proper rseq support, leading to the error. A simple usability workaround is to disable rseq in glibc. The patch disables rseq if JVM is started with -XX:CRaCheckpointTo and there is no explicit setting of rseq for glibc. The workaround is not going to live forever, just until rseq support is implemented in the majority of environments. Alternatives like more clever detection of rseq in the ptrace, or detection of a paravirtualized environment seem too complex, having that positive impact from rseq on java performance is unknown. > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7822b1e24f2 > [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=95e114a0919d844d8fe07839cb6538b7f5ee920e. > [2] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/kerndat.c#L944 > [3] https://github.com/CRaC/criu/blob/cc01f191639a5c2c988f49f1e314d17b055497b2/criu/cr-dump.c#L225 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/crac/pull/31 From akozlov at openjdk.org Tue Oct 18 07:36:48 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 18 Oct 2022 07:36:48 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v3] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 08:21:56 GMT, Roman Marchenko wrote: >> This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. >> >> Env propagation is done in few steps: >> - Store the actual environment before restoring >> - After restoring, replace the restored `environ` with a new one. >> - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Fixing review comments Thank you to all participants of this interesting discussion. I see the points in different aspects, in Roman's implementation of the mechanism and in Dan's concerns about policy how it should be used, and how to debug possible issues. I propose to move with smaller steps and implement missing features and polish the policy in the subsequent PR(s), on top of the simpler implementation. ------------- PR: https://git.openjdk.org/crac/pull/30 From heidinga at openjdk.org Tue Oct 18 13:43:38 2022 From: heidinga at openjdk.org (Dan Heidinga) Date: Tue, 18 Oct 2022 13:43:38 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v3] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 08:21:56 GMT, Roman Marchenko wrote: >> This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. >> >> Env propagation is done in few steps: >> - Store the actual environment before restoring >> - After restoring, replace the restored `environ` with a new one. >> - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Fixing review comments Is there any interest in working together on a document that outlines the overall approach to these kinds of issues in CRaC? I can see we'll have a similar discussion when we start updating "-D" system properties on restore, when we want to deal with addressing time deltas (System.nanoTime vs System.currentTimeInMillis) and other areas. Working on some overall principles on how CRaC should operate will help us be consistent in the approach we take and ensure we have a solid foundation to reason about each change. It'll also help us to identify and remember design constraints like debug-ability, service-ability, performance and, of course, useability. I don't want to block this PR on such an effort, but I think we'd greatly benefit from such a document before users start trying CRaC in anger. ------------- PR: https://git.openjdk.org/crac/pull/30 From akozlov at openjdk.org Wed Oct 19 14:27:41 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 19 Oct 2022 14:27:41 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v3] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 08:21:56 GMT, Roman Marchenko wrote: >> This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. >> >> Env propagation is done in few steps: >> - Store the actual environment before restoring >> - After restoring, replace the restored `environ` with a new one. >> - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Fixing review comments Makes sense. I'll try to draft something to start from. ------------- PR: https://git.openjdk.org/crac/pull/30 From duke at openjdk.org Thu Oct 20 06:49:45 2022 From: duke at openjdk.org (Roman Marchenko) Date: Thu, 20 Oct 2022 06:49:45 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v3] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 08:21:56 GMT, Roman Marchenko wrote: >> This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. >> >> Env propagation is done in few steps: >> - Store the actual environment before restoring >> - After restoring, replace the restored `environ` with a new one. >> - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Fixing review comments Thank you all for your comments! As far as I understand, we've got some different opinions about further directions, so we need to work on principles to make the approaches consistent. Does the document creation stop us from merging this PR now? If no, I'd merge it. ------------- PR: https://git.openjdk.org/crac/pull/30 From akozlov at openjdk.org Thu Oct 20 07:24:17 2022 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 20 Oct 2022 07:24:17 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v3] In-Reply-To: References: Message-ID: On Mon, 3 Oct 2022 08:21:56 GMT, Roman Marchenko wrote: >> This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. >> >> Env propagation is done in few steps: >> - Store the actual environment before restoring >> - After restoring, replace the restored `environ` with a new one. >> - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. > > Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: > > Fixing review comments Merging is fine to me. I think policy, etc will be implemented on top, extending the functionality of this PR. ------------- PR: https://git.openjdk.org/crac/pull/30 From heidinga at openjdk.org Thu Oct 20 12:51:20 2022 From: heidinga at openjdk.org (Dan Heidinga) Date: Thu, 20 Oct 2022 12:51:20 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v3] In-Reply-To: References: Message-ID: <8EFWs_rBIVHbjcVeA3DnZaiyLMxlIcUaf82w9ZSz1is=.ecb7933a-156d-4da2-b6dc-06bafe4474ef@github.com> On Thu, 20 Oct 2022 07:22:04 GMT, Anton Kozlov wrote: > Merging is fine to me. I think policy, etc will be implemented on top, extending the functionality of this PR. Agreed. Let's merge this, define the principles, and then iterate on it. ------------- PR: https://git.openjdk.org/crac/pull/30 From duke at openjdk.org Mon Oct 24 12:07:24 2022 From: duke at openjdk.org (Roman Marchenko) Date: Mon, 24 Oct 2022 12:07:24 GMT Subject: [crac] Integrated: Environment vars propagation into restored process In-Reply-To: References: Message-ID: <6xR-GCByqbJR6gio9JGvUtiki3jtJ2S0NTfIBtQck34=.ca7a68d3-b418-4208-8665-e455f5014a95@github.com> On Thu, 29 Sep 2022 15:38:52 GMT, Roman Marchenko wrote: > This PR provides functionality to propagate actual environment variables to a restored process, as well as the test for this functionality. > > Env propagation is done in few steps: > - Store the actual environment before restoring > - After restoring, replace the restored `environ` with a new one. > - On `afterRestore` event, propagate the new environment into a restored process via `ProcessEnvironment`. This pull request has now been integrated. Changeset: 217d5bc6 Author: Roman Marchenko Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/217d5bc6b3eb97239cdf699a9ca85372465699a2 Stats: 175 lines in 4 files changed: 171 ins; 0 del; 4 mod Environment vars propagation into restored process Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/30 From heidinga at openjdk.org Mon Oct 24 13:00:25 2022 From: heidinga at openjdk.org (Dan Heidinga) Date: Mon, 24 Oct 2022 13:00:25 GMT Subject: [crac] RFR: Environment vars propagation into restored process [v3] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 06:47:29 GMT, Roman Marchenko wrote: >> Roman Marchenko has updated the pull request incrementally with one additional commit since the last revision: >> >> Fixing review comments > > Thank you all for your comments! As far as I understand, we've got some different opinions about further directions, so we need to work on principles to make the approaches consistent. > Does the document creation stop us from merging this PR now? If no, I'd merge it. I wanted to follow up on this and thank everyone involved in the discussion for being open to working through the issues here. And a big thanks to @wkia for persevering on this patch! Looking forward to collaborating with all of you on the principles document. ------------- PR: https://git.openjdk.org/crac/pull/30 From inakonechnyy at openjdk.org Mon Oct 31 16:16:22 2022 From: inakonechnyy at openjdk.org (Ilarion Nakonechnyy) Date: Mon, 31 Oct 2022 16:16:22 GMT Subject: [crac] RFR: Report checkpoint processing to jcmd [v25] In-Reply-To: References: Message-ID: > pass output stream from diagnosticCommand.cpp through java code into os_linux.cpp::VM_crac::doit() Ilarion Nakonechnyy has updated the pull request with a new target base due to a merge or a rebase. ------------- Changes: https://git.openjdk.org/crac/pull/10/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=10&range=24 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/10.diff Fetch: git fetch https://git.openjdk.org/crac pull/10/head:pull/10 PR: https://git.openjdk.org/crac/pull/10 From inakonechnyy at openjdk.org Mon Oct 31 16:16:23 2022 From: inakonechnyy at openjdk.org (Ilarion Nakonechnyy) Date: Mon, 31 Oct 2022 16:16:23 GMT Subject: [crac] Withdrawn: Report checkpoint processing to jcmd In-Reply-To: References: Message-ID: On Tue, 25 Jan 2022 15:07:55 GMT, Ilarion Nakonechnyy wrote: > pass output stream from diagnosticCommand.cpp through java code into os_linux.cpp::VM_crac::doit() This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/crac/pull/10 From gcardosi at redhat.com Mon Oct 10 13:29:17 2022 From: gcardosi at redhat.com (Gabriele Cardosi) Date: Mon, 10 Oct 2022 13:29:17 -0000 Subject: CRac example usage Message-ID: Hi all, I'm trying to run the spring-boot example from the github repo. I tried with the JDK 17 build, but it did not work due to the "java.base does not 'opens java.lang'" issue. Then I tried with the JDK 14 build, but this time I have the "jdk.crac.impl.CheckpointOpenFileException: /var/lib/sss/mc/passwd" exception. I also tried with the workaround explained here ( https://mail.openjdk.org/pipermail/crac-dev/2022-January/000079.html) but it did not work on my machine (RHEL 8.6). Do you have any suggestions on how to proceed ? Many thanks Best Regards Gabriele -- GABRIELE CARDOSI SENIOR SOFTWARE ENGINEERS, MW Red Hat Ltd gcardosi at redhat.com M: +39-3461717132 -------------- next part -------------- An HTML attachment was scrubbed... URL: