[crac] RFR: Introduce per-Priority Context with different policies [v3]
Radim Vansa
duke at openjdk.org
Tue May 30 14:32:23 UTC 2023
On Tue, 30 May 2023 12:36:27 GMT, Anton Kozlov <akozlov at openjdk.org> wrote:
>> One of the points in #60 was to make user errors *visible*, rather than silently passing - that was one reason I pushed that PR hard. Now looking again into `OrderedContext` that's what this PR brings and I strongly disagree. There's two ways to cope with those errors in the order of registration I know of: either throw an error (and I don't mind if it's from the `Core.checkpointRestore` or straight from `register`) or block, and you've chosen the latter option, stating that the problem is clearly visible from the threaddump.
>
> It was burried along other things that PR was doing. After some thought [1] Resources are used for two different purposes: fix-up the state to ensure checkpoint won't fail later (close file descriptors, which otherwise will be detected later); and later we discovered some resources are ensuring properties of the image (e.g. cleaning secrets). The blocking behavior makes sense for the latter, and for real concurrent registration it uses the race to delay registration and thus the problem is resolved without user intervention. But it brings significant usability drawback with the deadlock with registration from the same thread when notification has been started. And this is IMO more frequent that a registration from another thread, which is waited for by the thread executing e.g. beforeCheckpoint. For the clean-up, blocking
>
> We should not impose such a big inconvenience for the more frequent use case, regardless how the error is detected or reported. Applications, in their turn, can implement a blocking context with necessary guarantees (or use some implementation, which we eventually expose, once we got this contextes shaked out).
>
> My take on blocking is preffered to an exception was in context of JDK implementation: we do not expose the problem via API, but auto-deadlocks is a bug in the JDK implementaiton.
>
> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545853147
You won't know whether the resource registered in global context is to be invoked optionally or if it's mandatory; any checks like those for FDs is only a last line of defense and debugging help. Therefore not being executed before C/R is a problem the platform should detect, preventing C/R from happening, to be on the safe side. We should not choose the unsafe for default, because users likely won't find out that something fishy is happening without any 'inconvenience'. As you say, users can change the semantics in their own context once they know.
I agree that deadlock is not the best way to report this, if you want to choice different means for the global context I am okay with that. We've talked about `register` throwing might not be the best option as the exception could be swallowed, exception from `Core.checkpointRestore` along with stack trace of the registration would be preferable.
-------------
PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1210371421
More information about the crac-dev
mailing list