[crac] RFR: Fix ordering of invocation on Resources [v3]

Radim Vansa duke at openjdk.org
Tue May 2 07:23:49 UTC 2023


On Fri, 28 Apr 2023 12:58:59 GMT, Anton Kozlov <akozlov at openjdk.org> wrote:

>> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - More fine-grained synchronization
>>  - Rework context ordering (round 2)
>>    
>>    * call afterRestore even if beforeCheckpoint throws
>>    * registering resource in previous/running context does not trigger exception immediatelly
>>    ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time
>>    * we don't guarantee threads not deadlocking when trying to register a resource, though
>
> src/java.base/share/classes/jdk/crac/impl/OrderedContext.java line 60:
> 
>> 58:         // It is possible that something registers to us during restore but before
>> 59:         // this context's afterRestore was called.
>> 60:         if (checkpointing && !Core.isRestoring()) {
> 
> There is a small window between all beforeCheckpoint() are finished and checkpoint. In this window we'll call setModified(). An there is another window between restore and afterRestore() processing is started, where we'll won't call setModified(). Getting the exception or not will be a result of a race between checkpoint/restore (actual event with near-zero duration, without calling Resources) and registration. 
> 
> A Resource may also have an empty beforeCheckpoint() and some afterRestore() clean up. We'll register the resource for the next round of checkpoint/restore and will be silence about newly registered Resource. But since beforeCheckpoint() is empty, the original intent could be to do something useful on restore, which won't be done.

Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified).

If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently.

-------------

PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182178695


More information about the crac-dev mailing list