[crac] RFR: Fix ordering of invocation on Resources

Mon Apr 24 08:28:16 UTC 2023

On Mon, 24 Apr 2023 07:03:31 GMT, Radim Vansa <duke at openjdk.org> wrote:

>> src/java.base/share/classes/jdk/crac/Resource.java line 41:
>> 
>>> 39:      * Invoked by a {@code Context} as a notification about checkpoint.
>>> 40:      * Order of checkpoint notification is the reverse order of
>>> 41:      * {@link Context#register(Resource) registration}.
>> 
>> This is correct for the Global Context, but the order can be different in other Contexts.
>> 
>> I don't think Resource javadoc should describe the ordering (otherwise, a Context that uses a different ordering should refuse registration, but the Context does not have any mean to know the resource assumes some particular ordering).
>
> Alright, I think that the order for the global context is rather hidden in the `package-info.java`. So, I'll add a note here that the order is defined by the context, and move this info to `Core.getGlobalContext` javadoc.
> 
> Can we say that the fact that afterRestore is called in an inverse order to beforeCheckpoint is an universal rule, or would you keep that specific to context impl?

I think that should be leaved to Context, as ordering does not look quite Resource concern. Contextes were made to hande ordering.

>> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 54:
>> 
>>> 52:             locked = true;
>>> 53:             // This is important for the case of recursive registration
>>> 54:             throwIfCheckpointInProgress(priority);
>> 
>> What is expected handling of this exception? And in the current form the exception is not checked, so in most cases that exception won't be expected. Having that in most cases resources are registered during class or object initialization, those entities will be caught in partially constructed states, likely leaving the whole system stuck.
>> 
>> Would it not better to allow registration, but throw CheckpointException at the end of Context.beforeCheckpoint? That is pretty legit exception that means the checkpoint can be attempted again, and another attempt can be pretty well successful, if all Resources do not throw and no new Resources are registered.
>
> So, making `register` throw a checked exception wouldn't be a good solution; most of the time there wouldn't be a better handling than to log && rethrow.
> 
> What you propose - having a form of exception stack in the context and checking it anything was added after `beforeCheckpoint` could work, though this is definitely not a pattern that would be natural in Java - it smells of JNI developer :)
> 
> I think that a situation when the ISE is thrown is a programming error, rather than reacting on the state of 'outer world' (as in case of e.g. I/O...) and as such this can stay as unchecked exception, possibly leading to inconsistent state of the program.

I admit that Resource is usually registered implicitly, and the current state (silently ignore newly registered Resources when checkpoint is in progress) is not adequate and may leads to some state leaking to the image, while that is not supposed to (if Resource would be registered some time before).

With unchecked ISE, the user will have to somehow ensure that registration likley being implicit, changing the code to break the abstraction above that Resource/registration.

With the check for new Resources, we'll detect exactly the situtaion user will likely like to know (if everything in the state ready to be stored to the image), and if not, a recovery will be trying again. Every Resource may throw  CheckpointException for their own reasons, and CheckpointException("timeout for waiting some critical clean up, try again") is expected to be something very possible and is expected.

>> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 117:
>> 
>>> 115:                     restoreQ.add(r);
>>> 116:                 } catch (CheckpointException e) {
>>> 117:                     enqueueIfContext(r);
>> 
>> Ohh, I see the problem
>> 
>>> When Context.beforeCheckpoint throws, invoke Context.afterRestore anyway (otherwise some resources stay in suspended state).
>> 
>> So this is proposed to be handled in the parent context? Have you considered fixing that in the child context, run afterRestore for successfully checkpointed resources, before throwing CheckpointException to the parent context?
>
> Right now we run the checkpoint for all resources even if some fail. Are you fine with erroring out on the first exception?

Why is that necessary? If we are trying to avoid second afterRestore for sucessfully checkpointed Resources, that would mean that Context.afterRestore is called, that controverts the problem definition.

Assuming afterRestore is not called for the Resource with failed beforeCheckpoint. Suppose we have R1, R2 in a child Context, and R1.beforeCheckpoint succeed, and R2.beforeCheckpoint fails. We can call R1.afterRestore, before throwing from Ctx.beforeCheckpoint. That will restore successfully checkpointed resources.

But I checked the code and it seems that we do call afterRestore regardless of the result of beforeCheckpoint 

https://github.com/openjdk/crac/blob/95394e84683f1a816c0283f8c834072324516fba/src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java#L54

Every Resource available is collected in List resources and then restore queue is exactly the reverse of that list, it does not matter if any of resources throwed CheckpointException. Right?

-------------

PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1174890341
PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1174909443
PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1174946729