[crac] RFR: Environment vars propagation into restored process

Roman Marchenko duke at openjdk.org
Wed Oct 5 11:54:50 UTC 2022


On Mon, 3 Oct 2022 15:21:05 GMT, Dan Heidinga <heidinga at openjdk.org> wrote:

>> @DanHeidinga 
>> Hi, 
>> You're right in your concerns. Indeed the suggested enhancement changes the usual workflow, so users may be confused. 
>> That is why we expect users to explicitly adapt their applications in accordance with the behaviour and make sure it works, otherwise there is no guarantee the application run with CRaC is successful.
>
> @wkia You're right the users will need to adapt their applications to work with CRaC.  100% agree there.
> 
> The challenge for them will be when they use 3rd party libraries or update their existing applications to work.  It's really easy to miss updating something or not realize the full blast radius of changes requiring updates when an env var becomes "stale" after a restart.
> 
> To be safe, I think we need to review the use of env vars in the JDK and ensure that both the native code and the class libraries take correct action on changed env vars.
> 
> We should also consider doing something similar to the OpenJ9 approach where we restrict the set of env vars available prior to the checkpoint (minimize the accidental use of checkpoint env), and limit the env var changes to only add new env vars (no inconsistencies).  This got them a long ways in their work with Liberty though they did find it necessary to eventually support overriding some env vars.
> 
> With the approach in this PR, it will be hard for service engineers to know what the original env was and to debug issues related to changed env vars. Are there bread crumbs we can leave to make that service work go more smoothly?

@DanHeidinga 
For a simple scenario, when nothing is changed in the environment, user applications don't need to be changed, the applications work out-of-the-box.

Could we consider a scenario when a container has a different environment for restoring a process rather than the environment it was checkpoint'ed? The different environment means something has changed in the system, it doesn't matter why. In case the application is not prepared and the process doesn't expect that environment could be changed after restoration (and/or doesn't handle this correctly), the process may have outdated view on the env. So the process may need to be reconfigured to continue working. It's not about particular env var values, but system changes. In case we don't propagate all the changed env vars to the process, the process doesn't have a chance to reconfigure itself. So propagation of all the changed env vars seems necessary. Of course, users need to make necessary changes to prepare their applications for checkpoint/restore events, otherwise the applications couldn't work properly.

Speaking about debugging, currently users are able to create simple apps to print out vars, see the example below:


> java -XX:CRaCRestoreFrom=./restore_folder TestApp

public class TestApp {
    public static void main(String args[]) throws Exception {
        for (Map.Entry<String, String> e : System.getenv().entrySet()) {
            System.out.println(e.getKey() + " = " + e.getValue());
        }
    }
}

-------------

PR: https://git.openjdk.org/crac/pull/30


More information about the crac-dev mailing list