[crac] RFR: Environment vars propagation into restored process

Wed Oct 5 13:32:01 UTC 2022

On Mon, 3 Oct 2022 15:21:05 GMT, Dan Heidinga <heidinga at openjdk.org> wrote:

>> @DanHeidinga 
>> Hi, 
>> You're right in your concerns. Indeed the suggested enhancement changes the usual workflow, so users may be confused. 
>> That is why we expect users to explicitly adapt their applications in accordance with the behaviour and make sure it works, otherwise there is no guarantee the application run with CRaC is successful.
>
> @wkia You're right the users will need to adapt their applications to work with CRaC.  100% agree there.
> 
> The challenge for them will be when they use 3rd party libraries or update their existing applications to work.  It's really easy to miss updating something or not realize the full blast radius of changes requiring updates when an env var becomes "stale" after a restart.
> 
> To be safe, I think we need to review the use of env vars in the JDK and ensure that both the native code and the class libraries take correct action on changed env vars.
> 
> We should also consider doing something similar to the OpenJ9 approach where we restrict the set of env vars available prior to the checkpoint (minimize the accidental use of checkpoint env), and limit the env var changes to only add new env vars (no inconsistencies).  This got them a long ways in their work with Liberty though they did find it necessary to eventually support overriding some env vars.
> 
> With the approach in this PR, it will be hard for service engineers to know what the original env was and to debug issues related to changed env vars. Are there bread crumbs we can leave to make that service work go more smoothly?

> @DanHeidinga For a simple scenario, when nothing is changed in the environment, user applications don't need to be changed, the applications work out-of-the-box.
> 

Agreed.  We restore the process and the environment is the same so the application code doesn't need to be updated to make "new" decisions regarding env vars.

> Could we consider a scenario when a container has a different environment for restoring a process rather than the environment it was checkpoint'ed? The different environment means something has changed in the system, it doesn't matter why. 

When using CRIU, we're restoring a full process so on restore, the env is the **same** as it was at the checkpoint time.  The only way to make it different is to **inject** something.  We're not talking just about env changes but about how to inject new configuration into the system so that system can respond to the change.

> In case the application is not prepared and the process doesn't expect that environment could be changed after restoration (and/or doesn't handle this correctly), the process may have outdated view on the env. So the process may need to be reconfigured to continue working. 

We agree here.  I want to point out the challenge here is that most Java applications are not written to expect env vars to change.  We have a huge body of applications and libraries that expect the env to be stable and allowing the env to change will result in strange inconsistencies from 3rd party code that the user has no idea ever used the env for anything.

Think of those old dusty jar files that no one has source for any more but is still in wide use.  Those are the dark corners these kinds of changes trip over.

Java has no standard mechanism for updating env vars so applications don't expect them to change.

> It's not about particular env var values, but system changes. In case we don't propagate all the changed env vars to the process, the process doesn't have a chance to reconfigure itself. So propagation of all the changed env vars seems necessary. 

So we need a way to inject new configuration data into the restored application.  Env vars are a convenient way to do this as they are already used when deploying containers to inject configuration.  Wholesale replacement of the env seems like a really scary way to pull the rug out from under existing applications that may have been only partially configured at the time of the checkpoint.  

A full scale env replacement mechanism requires every existing library needs to be reviewed for use of System.getEnv and updated to reconfigure if their env vars change.  And that means all users of those libraries also need to be reviewed and potentially reconfigured.  It makes adoption harder and less safe.

By limiting access to the env prior to checkpoint (only a subset of env vars are available, maybe configurable?), allowing new env vars to be injected at restore, and having a limited way to override env vars, we contain the potential side effects and allow users to reason about the code they are going to run more easily.

There's a really good discussion of this problem in the OpenJ9 issues [0] and you can see how the design evolved from add only, to eventually a limited amount of override.  There's also a writeup from the GraalVM team about capturing build time state that applies here too. [1] Both are worth a read.

[0] https://github.com/eclipse-openj9/openj9/issues/13545
[1] https://github.com/graalvm/taming-build-time-initialization#host-machine-data-leakage

> Of course, users need to make necessary changes to prepare their applications for checkpoint/restore events, otherwise the applications couldn't work properly.
> 

Agreed.  Our design here can make adoption easier or harder.  The more we can take into account the existing code that applications depend on, the easier we can make the adoption path.  We don't want to make the old code behave in new ways, but we can make it safer (or not) to use that old code.

> Speaking about debugging, currently users are able to create simple apps to print out vars, see the example below:
> 
> ```
> > java -XX:CRaCRestoreFrom=./restore_folder TestApp
> 
> public class TestApp {
>     public static void main(String args[]) throws Exception {
>         for (Map.Entry<String, String> e : System.getenv().entrySet()) {
>             System.out.println(e.getKey() + " = " + e.getValue());
>         }
>     }
> }
> ```

This is a one really cool capability with CRaC in that we can run different applications from the same checkpoint.  It doesn't really address the serviceability concerns though - when dealing with end user problems, we might get the `java -version` output and a system core file.  To make it debuggable we need a way to see that the env vars were changed to avoid chasing  inconsistencies related to timing of env var lookup (pre checkpoint vs post restore).

-------------

PR: https://git.openjdk.org/crac/pull/30