From duke at openjdk.org Tue May 2 06:40:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 06:40:45 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 11:51:19 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/PriorityContext.java line 21: > >> 19: // CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) >> 20: // ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE >> 21: // POSSIBILITY OF SUCH DAMAGE. > > Use standard copyright please Uh, OK - I've copy pasted what's in the AbstractContextImpl as I've forked the class... ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182142187 From duke at openjdk.org Tue May 2 06:48:42 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 06:48:42 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 11:59:17 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 59: > >> 57: recordExceptions(e); >> 58: } catch (Exception e) { >> 59: Core.recordException(e); > > Why is there is the distinction? I think we should throw all exceptions from the context, rather than publishing them to a central store, otherwise the parent Context (if any), won't be able to do anything about those. The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. We could have a method to check whether this C/R is 'marked for rollback' (has any exceptions), but I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. About the distinction: the difference is that if you get a CheckpointException you'd unwrap it, recording only the inner suppressed ones. But I should push that to `recordExceptions` and rather decide based on CheckpointException message than number of suppressions. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182150531 From duke at openjdk.org Tue May 2 07:06:01 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:06:01 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: <3Y237gR4g69wXdE-awYTVqazWTXKFYNsaJSjwEeVWbA=.56181d61-a6bd-4706-8a12-adf5cf2fff7b@github.com> On Fri, 28 Apr 2023 12:01:42 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 78: > >> 76: } >> 77: >> 78: protected abstract void runBeforeCheckpoint(); > > This is intended to be overwritten (becomes a part of the class interface). The intent behind the separate method is not evident. Corresponding runAfterRestore is private though. > > After AbstractContexImpl has lost parameter P and comparator, a distinction between AbstractContexImpl and OrderedContext has been lost. Merging AbstractContexImpl into OrderedContext likely will provided clearer code. ACI is implemented both by OrderedContext and PriorityContext, while PC is quite different from OC. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182164379 From duke at openjdk.org Tue May 2 07:09:46 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:09:46 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 12:06:19 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 55: > >> 53: restoreQ.add(resource); >> 54: try { >> 55: resource.beforeCheckpoint(semanticContext()); > > Does this mean a Resource may get another Context and not the one to which it has been registered? This may be very unexpected for the Resource implementation. Theoretically, the method could do that. However, here the purpose of `semanticContext()` is to pass the context to which it was registered rather than the subcontext where it is stored (but this is an implementation detail that the resource does not know about). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182166930 From duke at openjdk.org Tue May 2 07:23:49 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:23:49 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 12:58:59 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/OrderedContext.java line 60: > >> 58: // It is possible that something registers to us during restore but before >> 59: // this context's afterRestore was called. >> 60: if (checkpointing && !Core.isRestoring()) { > > There is a small window between all beforeCheckpoint() are finished and checkpoint. In this window we'll call setModified(). An there is another window between restore and afterRestore() processing is started, where we'll won't call setModified(). Getting the exception or not will be a result of a race between checkpoint/restore (actual event with near-zero duration, without calling Resources) and registration. > > A Resource may also have an empty beforeCheckpoint() and some afterRestore() clean up. We'll register the resource for the next round of checkpoint/restore and will be silence about newly registered Resource. But since beforeCheckpoint() is empty, the original intent could be to do something useful on restore, which won't be done. Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified). If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182178695 From duke at openjdk.org Tue May 2 07:27:48 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:27:48 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 13:14:20 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/Core.java line 104: > >> 102: * Order of invoking {@link Resource#afterRestore(Context)} is the reverse >> 103: * of the order of {@link Resource#beforeCheckpoint(Context) checkpoint notification}, >> 104: * hence the same as the order of {@link Context#register(Resource) registration}. > > How about moving the Global Context description from the package level here (removing there). In javax.crac it should be fine to link to here IMO. Okay, I can remove the description in package. As for `javax.crac` - I thought that this should be really a mirror of `jdk.crac`, why the distinction? Another way might be to make OrderedContext a marker interface (move the implementation to OrderedContextImpl) and put the description there, using this interface for `getGlobalContext()`. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182182075 From duke at openjdk.org Tue May 2 07:41:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:41:44 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore In-Reply-To: References:

Message-ID: On Fri, 28 Apr 2023 14:56:42 GMT, Anton Kozlov wrote: >> When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. > > src/hotspot/os/linux/os_linux.cpp line 6557: > >> 6555: ::_restore_start_counter = hdr->_restore_counter; >> 6556: >> 6557: for (int i = 0; i < hdr->_nflags; i++) { > > This check can be done in the "bootstrap" process (the one that execs to CREngine): just to avoid restoring and finding out the problem. See the other comment about producing the error. Makes sense, I'll turn those into errors. I should probably also check the presence of `-jar` and `-cp`/`--classpath` and produce a nice explanation; otherwise the code would interpret those as the new main class and its arguments. > src/hotspot/share/runtime/globals.hpp line 2096: > >> 2094: /* It does not make sense to change this flag in runtime but we'll tag */ \ >> 2095: /* it MANAGEABLE to prevent warnings when setting this on restore. */ \ >> 2096: product(ccstr, CRaCRestoreFrom, NULL, MANAGEABLE, \ > > This is an example why we want "can be set on restore" (RESTOREBLE?) flag. So MANAGABLE will be implying RESTORABLE, but not every RESTORABLE will be MANAGEABLE. Below we see that `CRaCIgnoredFileDescriptors` is RESTORABLE (I would rather use SET_ON_RESTORE as all flags are restored at its previous value) but not MANAGEABLE, so that's not perfect either. The reason why I rather stayed on MANAGEABLE was to prevent changing every `MANAGEABLE` to `MANAGEABLE | SET_ON_RESTORE`, that would complicate any backport from mainline. I can do the SET_ON_RESTORE (as superset of MANAGEABLE) but I don't think that the few exceptions that could be rather documented by a few lines of comment really deserve a separate flag. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1182193699 PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1182191415 From akozlov at openjdk.org Tue May 2 09:45:47 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 09:45:47 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Tue, 2 May 2023 06:45:45 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 59: >> >>> 57: recordExceptions(e); >>> 58: } catch (Exception e) { >>> 59: Core.recordException(e); >> >> Why is there is the distinction? I think we should throw all exceptions from the context, rather than publishing them to a central store, otherwise the parent Context (if any), won't be able to do anything about those. > > The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. We could have a method to check whether this C/R is 'marked for rollback' (has any exceptions), but I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. > > About the distinction: the difference is that if you get a CheckpointException you'd unwrap it, recording only the inner suppressed ones. But I should push that to `recordExceptions` and rather decide based on CheckpointException message than number of suppressions. > The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. ... I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. The parent Context may implement an artbitrary handling (for example, unloading a component compleltely if that is throwing an exception). So throwing an exception is something useful. With that, the new Core.recordException is completely new exception flow, that just opimizes somthing the generic throw scheme. With that, the generic schemes should be something good enough already, we don't need to complicate the interface, the code,.. CheckpointExcepotions are still exceptions, that is, we don't expect them often, there is no need to optimize them. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182328171 From akozlov at openjdk.org Tue May 2 10:06:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 10:06:43 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Tue, 2 May 2023 07:06:48 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 55: >> >>> 53: restoreQ.add(resource); >>> 54: try { >>> 55: resource.beforeCheckpoint(semanticContext()); >> >> Does this mean a Resource may get another Context and not the one to which it has been registered? This may be very unexpected for the Resource implementation. > > Theoretically, the method could do that. However, here the purpose of `semanticContext()` is to pass the context to which it was registered rather than the subcontext where it is stored (but this is an implementation detail that the resource does not know about). Just realized that this is required for PriorityContext implementation. But that is the implementation of that class, it's wrong ACI has to care about that. >> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 78: >> >>> 76: } >>> 77: >>> 78: protected abstract void runBeforeCheckpoint(); >> >> This is intended to be overwritten (becomes a part of the class interface). The intent behind the separate method is not evident. Corresponding runAfterRestore is private though. >> >> After AbstractContexImpl has lost parameter P and comparator, a distinction between AbstractContexImpl and OrderedContext has been lost. Merging AbstractContexImpl into OrderedContext likely will provided clearer code. > > ACI is implemented both by OrderedContext and PriorityContext, while PC is quite different from OC. I'm trying to describe class hierachy and failing. The patch tries to reverse ACI-PC relation. ACI was for partially-ordered Resources (defined by a Comparator), and now it's for totally-ordered Resources (ordered by long). Trying to fit the partially-ordering PC as a subclass of the totally-ordering ACI feels unnatural. Can we have a cleaner hierachy of the classes? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182349575 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182347913 From duke at openjdk.org Tue May 2 10:30:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 10:30:44 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com>

Message-ID: On Tue, 2 May 2023 10:02:34 GMT, Anton Kozlov wrote: >> ACI is implemented both by OrderedContext and PriorityContext, while PC is quite different from OC. > > I'm trying to describe class hierachy and failing. The patch tries to reverse ACI-PC relation. ACI was for partially-ordered Resources (defined by a Comparator), and now it's for totally-ordered Resources (ordered by long). Trying to fit the partially-ordering PC as a subclass of the totally-ordering ACI feels unnatural. Can we have a cleaner hierachy of the classes? No; ACI was totally-ordered in the previous version of the PR, but now it doesn't care about ordering at all; it's the abstract `runBeforeCheckpoint` that decides on the order. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182370220 From duke at openjdk.org Tue May 2 10:44:48 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 10:44:48 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com>

Message-ID: <50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com> On Tue, 2 May 2023 10:04:21 GMT, Anton Kozlov wrote: >> Theoretically, the method could do that. However, here the purpose of `semanticContext()` is to pass the context to which it was registered rather than the subcontext where it is stored (but this is an implementation detail that the resource does not know about). > > Just realized that this is required for PriorityContext implementation. But that is the implementation of that class, it's wrong ACI has to care about that. Yes, it's a bit of enforced flexibility of the base class (through allowing to override a method), though it doesn't need about the use case. I could use just a list rather than sub-contexts but that would require duplicated code. >> The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. We could have a method to check whether this C/R is 'marked for rollback' (has any exceptions), but I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. >> >> About the distinction: the difference is that if you get a CheckpointException you'd unwrap it, recording only the inner suppressed ones. But I should push that to `recordExceptions` and rather decide based on CheckpointException message than number of suppressions. > >> The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. ... I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. > > The parent Context may implement an artbitrary handling (for example, unloading a component compleltely if that is throwing an exception). So throwing an exception is something useful. > > With that, the new Core.recordException is completely new exception flow, that just opimizes somthing the generic throw scheme. With that, the generic schemes should be something good enough already, we don't need to complicate the interface, the code,.. CheckpointExcepotions are still exceptions, that is, we don't expect them often, there is no need to optimize them. It's not about optimization (in the sense of performance) but about removing code bloat, and the need for parent context to tediously copy failures (deciding whether something was a wrapper exception or the actual failure), when all you need to do in the end is to report them in bulk. It's not a new exception flow, it's removing the flow as there is no exception flow needed. Your example is invalid: If a throwing resource is to be removed, it's the task of the parent context - and that one will see the exception. The parent context should not remove its child context since one of the N resources in the child context is failing. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182382341 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182379465 From akozlov at openjdk.org Tue May 2 10:57:41 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 10:57:41 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Tue, 2 May 2023 07:21:24 GMT, Radim Vansa wrote: > Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. Suppose aR() has some really important side-effect, it's totally necessarily to run that on restore. Then it falls to the category of problems this PR tries to solve (silently ignoring registered resources). > If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified). > > If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently. Some form of the guarantee will be good. The blocking registration, I assume it something like register() to finish only when the argument is successfully registered? This looks like a viable approach, e.g. it does specify the behavior around possible race and do not affect "normal" workflow when registration happens way before the checkpoint. Do you see any problem with blocking registration? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182393741 From akozlov at openjdk.org Tue May 2 11:38:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 11:38:43 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore In-Reply-To: References:

Message-ID: On Tue, 2 May 2023 07:35:57 GMT, Radim Vansa wrote: > The reason why I rather stayed on MANAGEABLE was to prevent changing every `MANAGEABLE` to `MANAGEABLE | SET_ON_RESTORE`, that would complicate any backport from mainline. I tottaly support MANAGEABLE => RESTORABLE, as it makes sense. > I can do the SET_ON_RESTORE (as superset of MANAGEABLE) but I don't think that the few exceptions that could be rather documented by a few lines of comment really deserve a separate flag. Option are also some form of the documentation for the flags. So the new class of options deserve their own name. > Below we see that `CRaCIgnoredFileDescriptors` is RESTORABLE but not MANAGEABLE, so that's not perfect either. Yes, that is another one. And there are many more, like PrintCompilation. Or any other product flag which handling in the VM allows update on restore. Over time the set of RESTORABLE flags can grow as VM implementation allows. While there is a higher bar to include a flag into MANGEABLE set [1] [1] https://github.com/openjdk/crac/blob/crac/src/hotspot/share/runtime/globals.hpp#L86 > (I would rather use SET_ON_RESTORE as all flags are restored at its previous value) I don't like RESTORABLE as well (relates to RESTORE, but how is not clear). The new name better to fit into existing set of: DIAGNOSTIC, EXPERIMENTAL, or MANAGEABLE. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1182430278 From duke at openjdk.org Tue May 2 12:31:51 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 12:31:51 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com>

Message-ID: On Tue, 2 May 2023 10:54:51 GMT, Anton Kozlov wrote: >> Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified). >> >> If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently. > >> Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. > > Suppose aR() has some really important side-effect, it's totally necessarily to run that on restore. Then it falls to the category of problems this PR tries to solve (silently ignoring registered resources). > >> If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified). >> >> If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently. > > Some form of the guarantee will be good. The blocking registration, I assume it something like register() to finish only when the argument is successfully registered? This looks like a viable approach, e.g. it does specify the behavior around possible race and do not affect "normal" workflow when registration happens way before the checkpoint. Do you see any problem with blocking registration? I don't think we understand each other. Let's say you have a code like this: new Thread(() -> { Resource another = /* ... */; Core.getGlobalContext().register(another); }).start(); Core.checkpointRestore(); You insisted on `register()` that does not throw. What implementation could ensure that something eventually makes `Core.checkpointRestore()` throw? There's no guarantee that this will run before the checkpoint completes; the code does not order these in any way. The only result the user can expect is that the resource will be registered, eventually. Had you added a `CountDownLatch` triggered after calling `register()` and waited for at last somewhere in one of the `beforeCheckpoint` methods the race would not happen. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182481551 From akozlov at openjdk.org Tue May 2 14:03:17 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 14:03:17 GMT Subject: [crac] RFR: Backout new API to sync with Reference Handler [v3] In-Reply-To: References: Message-ID: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> > This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. > > The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. > > I think that API was a mistake and should be reverted. > > In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. > > The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. Anton Kozlov has updated the pull request incrementally with three additional commits since the last revision: - Bring back the test - Use in-place Resource - Bring back parts of the commit ------------- Changes: - all: https://git.openjdk.org/crac/pull/34/files - new: https://git.openjdk.org/crac/pull/34/files/63bc1847..73870426 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=34&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=34&range=01-02 Stats: 128 lines in 4 files changed: 126 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/34.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/34/head:pull/34 PR: https://git.openjdk.org/crac/pull/34 From akozlov at openjdk.org Tue May 2 16:21:44 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 16:21:44 GMT Subject: [crac] RFR: Backout new API to sync with Reference Handler [v3] In-Reply-To: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> References: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> Message-ID: On Tue, 2 May 2023 14:03:17 GMT, Anton Kozlov wrote: >> This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. >> >> The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. >> >> I think that API was a mistake and should be reverted. >> >> In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. >> >> The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. > > Anton Kozlov has updated the pull request incrementally with three additional commits since the last revision: > > - Bring back the test > - Use in-place Resource > - Bring back parts of the commit Turns out we cannot avoid synchronizing Cleaners with the checkpoint, otherwise, some native resources may remain open, if they are released as a result of running a Cleaner. This happens with JarFileFactory, which maintains a cache of URLJarFiles. To avoid their tracking, a lightweight GC-based tracking is implemented [1], that only requires that unused entries are reachable only from the cache. But it relies on Cleaner to run all the actions. The latest version declares PhantomCleanableRef's to be Resources, which triggers clean() on the checkpoint, rather than waiting for the ref to be processed by a reference processor thread. This preserves Cleaner behavior w.r.t. checkpoint, so the test is also retained. [1] https://github.com/openjdk/crac/blob/95394e84683f1a816c0283f8c834072324516fba/src/java.base/unix/classes/sun/net/www/protocol/jar/JarFileFactory.java#L255 ------------- PR Comment: https://git.openjdk.org/crac/pull/34#issuecomment-1531754292 From akozlov at openjdk.org Tue May 2 17:15:45 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 17:15:45 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v9] In-Reply-To: References:

Message-ID: <2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> On Mon, 24 Apr 2023 13:27:30 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Provide more information for file descriptors src/hotspot/os/linux/os_linux.cpp line 6092: > 6090: int ilen = snprintf(msg, maxinfo, "FD fd=%d type=%s path=", i, type); > 6091: ilen = ilen > maxinfo ? maxinfo : ilen; > 6092: strncpy(msg + ilen, detailsbuf, buflen - ilen); `ilen >= maxinfo` if the output was truncated [1] `strncpy` may leave the string unterminated if `buflen-ilen` smaller than details. So `snprintf(msg, maxinfo, "FD fd=%d type=%s path=%s", i, type, details)` will be a safer option. [1] RETURN in https://linux.die.net/man/3/snprintf src/java.base/share/classes/java/io/FileDescriptor.java line 400: > 398: } else { > 399: info = (path != null ? path : "unknown path") + " (" + (type != null ? type : "unknown") + ")"; > 400: } This have too many socket-related details, also a number of java/native transitions that will be unavoidable if we adopt the proposed interface. src/java.base/share/classes/java/net/Socket.java line 1939: > 1937: * @return Textual representation of the type. > 1938: */ > 1939: public static native String getType(int fd); `int fd` is a Unix platfrom detail. I propose a single function `public static String getDescription(FileDescriptor socket)`. That e.g. returns `null` if the FileDescriptor is not a socket. The method can be a native or not, depends on the implementation. src/java.base/unix/native/libnet/SocketImpl.c line 115: > 113: return NULL; > 114: } > 115: } No need for `} else {` here and everywhere else since the previous block has anyway terminated the function. This will make the code more streamlined. Suggestion: } localAddr = create_isa(env, isa_class, isa_ctor, &local); if (localAddr == NULL) { JNU_ThrowOutOfMemoryError(env, "java.net.InetSocketAddres"); return NULL; } ------------- PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1182827339 PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1182802505 PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1182794395 PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1182815623 From duke at openjdk.org Wed May 3 06:54:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 06:54:38 GMT Subject: [crac] RFR: Backout new API to sync with Reference Handler [v3] In-Reply-To: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> References: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> Message-ID: On Tue, 2 May 2023 14:03:17 GMT, Anton Kozlov wrote: >> This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. >> >> The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. >> >> I think that API was a mistake and should be reverted. >> >> In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. >> >> The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. > > Anton Kozlov has updated the pull request incrementally with three additional commits since the last revision: > > - Bring back the test > - Use in-place Resource > - Bring back parts of the commit I've commented out the `PhantomCleanableRef` registration for a test and in one case the test passed (other attempts failed) - I wish we could have the test failing more reliably. Nevertheless LGTM. ------------- PR Comment: https://git.openjdk.org/crac/pull/34#issuecomment-1532527018 From duke at openjdk.org Wed May 3 09:45:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 09:45:09 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Use RESTORE_SETTABLE on JVM flags * Fail early when using non settable flags * CracBuilder fix: don't use classpath during restore ------------- Changes: - all: https://git.openjdk.org/crac/pull/61/files - new: https://git.openjdk.org/crac/pull/61/files/b2a73eb9..21ec5e80 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=61&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=61&range=00-01 Stats: 126 lines in 8 files changed: 64 ins; 10 del; 52 mod Patch: https://git.openjdk.org/crac/pull/61.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/61/head:pull/61 PR: https://git.openjdk.org/crac/pull/61 From duke at openjdk.org Wed May 3 09:47:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 09:47:44 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore In-Reply-To: References: Message-ID: On Fri, 28 Apr 2023 07:24:06 GMT, Radim Vansa wrote: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. @AntonKozlov Created a separate bit (RESTORE_SETTABLE - adjective as the others), and using non-settable flags fails early. I've also tried to prohibit explicitly setting `-cp` and friends but in the `parse_options_for_restore` I already cannot tell where this comes from so I have not incorporated this in this PR. There's a few extra changes in `CracBuilder` that would help with testing ^ but I find them useful in general, so I've included those. ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1532735769 From akozlov at openjdk.org Wed May 3 09:57:49 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 3 May 2023 09:57:49 GMT Subject: [crac] RFR: Backout new API to sync with Reference Handler [v3] In-Reply-To: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> References: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> Message-ID: On Tue, 2 May 2023 14:03:17 GMT, Anton Kozlov wrote: >> This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. >> >> The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. >> >> I think that API was a mistake and should be reverted. >> >> In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. >> >> The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. > > Anton Kozlov has updated the pull request incrementally with three additional commits since the last revision: > > - Bring back the test > - Use in-place Resource > - Bring back parts of the commit Thanks for the review! ------------- PR Comment: https://git.openjdk.org/crac/pull/34#issuecomment-1532747527 From akozlov at openjdk.org Wed May 3 09:57:49 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 3 May 2023 09:57:49 GMT Subject: [crac] Integrated: Backout new API to sync with Reference Handler In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 15:34:23 GMT, Anton Kozlov wrote: > This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. > > The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. > > I think that API was a mistake and should be reverted. > > In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. > > The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. This pull request has now been integrated. Changeset: ccf33231 Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/ccf33231110c8e8dd3c47bae0a079d25a34ac8b5 Stats: 53 lines in 2 files changed: 19 ins; 32 del; 2 mod Backout new API to sync with Reference Handler ------------- PR: https://git.openjdk.org/crac/pull/34 From duke at openjdk.org Wed May 3 11:06:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 11:06:44 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 11:48:16 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 42: > >> 40: for (Throwable t : suppressed) { >> 41: Core.recordException(t); >> 42: } > > Unwrap Checkpoint/RestoreException only? This is how it's actually used; can't use union type in Java. I'll add an assertion to make this clear... ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1183542544 From akozlov at openjdk.org Wed May 3 11:28:38 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 3 May 2023 11:28:38 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References:

Message-ID: On Wed, 3 May 2023 09:45:09 GMT, Radim Vansa wrote: >> When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use RESTORE_SETTABLE on JVM flags > > * Fail early when using non settable flags > * CracBuilder fix: don't use classpath during restore Thanks, looks mostly good except few nits. Do you have an idea why jdk/crac/recursiveCheckpoint/Test.java has failed? src/hotspot/share/runtime/globals.hpp line 57: > 55: > 56: // The optional extra_attrs parameter may have one of the following values: > 57: // DIAGNOSTIC, EXPERIMENTAL, MANAGEABLE and RESTORE_SETTABLE. Currently ` and` -> `, or` src/hotspot/share/runtime/globals.hpp line 2094: > 2092: "Trace optimized upcall stub generation") \ > 2093: \ > 2094: product(ccstr, CRaCCheckpointTo, NULL, MANAGEABLE, \ I see reasons for CRaCCheckpointTo to be MANAGEABLE, but at the moment the flag is assumed to be set in the command line by the implementation, e.g. os::Linux::{vm_create_start,prepare_checkpoint} are called depending on the flag value, and that can happen only during VM initialization. A set of changes are required before the option can become MANAGEABLE. The test should also updated once the option is reverted. src/hotspot/share/runtime/globals.hpp line 2129: > 2127: "Throw CheckpointException if uncheckpointable resource handle found")\ > 2128: \ > 2129: product(bool, CRTrace, true, MANAGEABLE, "Minimal C/R tracing") \ RESTORE_SETTABLE was meant here? Please don't mix in MANAGEABLE flags into this PR if that was inteded. ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1532862561 PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1532864110 PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183538871 PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183553692 PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183485336 From akozlov at openjdk.org Wed May 3 11:28:40 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 3 May 2023 11:28:40 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References:

Message-ID: On Fri, 28 Apr 2023 15:07:41 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use RESTORE_SETTABLE on JVM flags >> >> * Fail early when using non settable flags >> * CracBuilder fix: don't use classpath during restore > > src/hotspot/os/linux/os_linux.cpp line 6579: > >> 6577: } >> 6578: if (result != JVMFlag::Error::SUCCESS) { >> 6579: warning("VM Option '%s' cannot be changed, ignoring: %s", > > A significant set of options cannot be set on restore at the moment, so it will be even better to highlight they don't have effect and produce an error. It may be useful to revert back to warning (with e.g. an option), but by default it should be disabled (leading to the error) The place can be `guarantee(result == JVMFlag::Error::SUCCESS, "...")` since the possibility was checked earlier. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183558035 From duke at openjdk.org Wed May 3 11:28:40 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 11:28:40 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References:

Message-ID: On Wed, 3 May 2023 10:03:40 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use RESTORE_SETTABLE on JVM flags >> >> * Fail early when using non settable flags >> * CracBuilder fix: don't use classpath during restore > > src/hotspot/share/runtime/globals.hpp line 2129: > >> 2127: "Throw CheckpointException if uncheckpointable resource handle found")\ >> 2128: \ >> 2129: product(bool, CRTrace, true, MANAGEABLE, "Minimal C/R tracing") \ > > RESTORE_SETTABLE was meant here? Please don't mix in MANAGEABLE flags into this PR if that was inteded. Looking for usages (actually only one) of the flag, it qualifies to be set at any time. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183560669 From duke at openjdk.org Wed May 3 12:40:43 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 12:40:43 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References:

Message-ID: <8nZsMojvW82BDqhDm-UBSM4NVRXeLPeTseO1w93_U3w=.75769e3d-cabf-4b85-99e2-66271be65bf9@github.com> On Wed, 3 May 2023 11:17:17 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use RESTORE_SETTABLE on JVM flags >> >> * Fail early when using non settable flags >> * CracBuilder fix: don't use classpath during restore > > src/hotspot/share/runtime/globals.hpp line 2094: > >> 2092: "Trace optimized upcall stub generation") \ >> 2093: \ >> 2094: product(ccstr, CRaCCheckpointTo, NULL, MANAGEABLE, \ > > I see reasons for CRaCCheckpointTo to be MANAGEABLE, but at the moment the flag is assumed to be set in the command line by the implementation, e.g. os::Linux::{vm_create_start,prepare_checkpoint} are called depending on the flag value, and that can happen only during VM initialization. > > A set of changes are required before the option can become MANAGEABLE. > > The test should also updated once the option is reverted. Well spotted, I was thinking about changing the path but these two places need to be called in order to prepare for a checkpoint. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183629613 From duke at openjdk.org Wed May 3 14:01:49 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Wed, 3 May 2023 14:01:49 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v13] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 > > That IMO does not preclude trying the same for this case. > > - Debian 11 x86_64: It does not work, glibc is too different and inlined there. > - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. > - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 80 commits: - Merge branch 'crac-altstack' into crac-altstack-tunables - Merge branch 'crac' into crac-altstack - Merge branch 'crac' into crac-altstack - 2b0f56b7: - ec18a208: - Fix the glibc SSE2 exception. - c446cae3: - Use CPU_FEATURE_ACTIVE. - Compatibility with old glibcs. - Fix a crash. - ... and 70 more: https://git.openjdk.org/crac/compare/ccf33231...9e6faf67 ------------- Changes: https://git.openjdk.org/crac/pull/41/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=41&range=12 Stats: 720 lines in 19 files changed: 697 ins; 3 del; 20 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Wed May 3 14:09:06 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Wed, 3 May 2023 14:09:06 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v13] In-Reply-To: References:

Message-ID: On Wed, 3 May 2023 14:01:49 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 >> >> That IMO does not preclude trying the same for this case. >> >> - Debian 11 x86_64: It does not work, glibc is too different and inlined there. >> - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. >> - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. > > Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 80 commits: > > - Merge branch 'crac-altstack' into crac-altstack-tunables > - Merge branch 'crac' into crac-altstack > - Merge branch 'crac' into crac-altstack > - 2b0f56b7: > - ec18a208: > - Fix the glibc SSE2 exception. > - c446cae3: > - Use CPU_FEATURE_ACTIVE. > - Compatibility with old glibcs. > - Fix a crash. > - ... and 70 more: https://git.openjdk.org/crac/compare/ccf33231...9e6faf67 Here is a variant which should be compatible with any glibc version using: **GLIBC_TUNABLES=:[glibc.cpu.hwcaps](https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html)=...** > `./build/linux-x86_64-server-slowdebug/jdk/bin/java -XX:CPUFeatures=generic -XX:+ShowCPUFeatures Hello.java` - On older glibcs not supporting macro `CPU_FEATURE_ACTIVE` the disabling of glibc features has no effect (and it may crash the migration even after using `-XX:CPUFeatures=generic`). - Also with newer glibcs than what the OpenJDK/CRaC sources support it may crash due to some new glibc feature not disabled by OpenJDK/CRaC. - I have setup (for me) tracking of [sysdeps/x86/sys/platform/x86.h](https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/x86/sys/platform/x86.h;hb=HEAD) and [sysdeps/x86/bits/platform/x86.h](https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/x86/bits/platform/x86.h;hb=HEAD). ------------- PR Comment: https://git.openjdk.org/crac/pull/41#issuecomment-1533091216 From duke at openjdk.org Thu May 4 07:17:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 07:17:54 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v3] In-Reply-To: References: Message-ID: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Fixup ------------- Changes: - all: https://git.openjdk.org/crac/pull/61/files - new: https://git.openjdk.org/crac/pull/61/files/21ec5e80..75ce1b64 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=61&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=61&range=01-02 Stats: 16 lines in 3 files changed: 6 ins; 5 del; 5 mod Patch: https://git.openjdk.org/crac/pull/61.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/61/head:pull/61 PR: https://git.openjdk.org/crac/pull/61 From duke at openjdk.org Thu May 4 07:17:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 07:17:54 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References:

Message-ID: On Wed, 3 May 2023 11:26:13 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use RESTORE_SETTABLE on JVM flags >> >> * Fail early when using non settable flags >> * CracBuilder fix: don't use classpath during restore > > Do you have an idea why jdk/crac/recursiveCheckpoint/Test.java has failed? @AntonKozlov Updated. The test failed rightfully, I've stopped adding the `CREngine` flag but that breaks the restore with non-default engine. Now made the flag RESTORE_SETTABLE. ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1534200247 From duke at openjdk.org Thu May 4 07:17:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 07:17:54 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 11:57:03 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 75: > >> 73: restoreQ = new ArrayList<>(); >> 74: runBeforeCheckpoint(); >> 75: Collections.reverse(restoreQ); > > Smelly code, restoreQ should be maintained either here or in runBeforeCheckpoint() Not really; the task of ACI subclass (OC, PC...) is to call `invokeBeforeCheckpoint` on some resources (ACI does not know which ones) in some order. The task of ACI is to remember the order of invocations and in `afterRestore` call this in a reversed order; the subclass does not need to know about any collection used for that. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1184634719 From duke at openjdk.org Thu May 4 09:54:05 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 4 May 2023 09:54:05 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 > > That IMO does not preclude trying the same for this case. > > - Debian 11 x86_64: It does not work, glibc is too different and inlined there. > - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. > - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: -altstack ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/9e6faf67..b22bb537 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=13 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=12-13 Stats: 11 lines in 1 file changed: 0 ins; 9 del; 2 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Thu May 4 10:27:49 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 10:27:49 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v4] In-Reply-To: References: Message-ID: <1xFtRyDqF-BREE5Vq1SSog1fbMG4y6JGjF1SlC4rBmQ=.8df4faf4-8e82-4d16-a638-9a64687ae1a9@github.com> > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Fix javadoc and minor refactoring - Merge branch 'crac' into context_order - More fine-grained synchronization - Rework context ordering (round 2) * call afterRestore even if beforeCheckpoint throws * registering resource in previous/running context does not trigger exception immediatelly ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time * we don't guarantee threads not deadlocking when trying to register a resource, though - Fix docs & package - Fix ordering of invocation on Resources * When Context.beforeCheckpoint throws, invoke Context.afterRestore anyway (otherwise some resources stay in suspended state). * Handle Resource.beforeCheckpoint triggering a registration of another resource ** Do not cause deadlock when registering from another thread ** Global resource can register JDKResource ** JDKResource can register resource with higher priority ** Other registrations are prohibited ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/1f2c7b39..eafdb841 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=02-03 Stats: 253 lines in 10 files changed: 97 ins; 102 del; 54 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Thu May 4 13:20:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 13:20:45 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v5] In-Reply-To: References: Message-ID: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Fix javadoc build ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/eafdb841..33226eba Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=03-04 Stats: 4 lines in 2 files changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From akozlov at openjdk.org Thu May 4 13:41:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 13:41:56 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain Message-ID: If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. ------------- Commit messages: - Fix copyright - Notify on the original thread - Ensure all notifications finish even if only daemon threads remain Changes: https://git.openjdk.org/crac/pull/62/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=62&range=00 Stats: 146 lines in 2 files changed: 136 ins; 3 del; 7 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From duke at openjdk.org Thu May 4 15:42:55 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 15:42:55 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain In-Reply-To: References: Message-ID: On Thu, 4 May 2023 13:34:47 GMT, Anton Kozlov wrote: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. The code is doing something (starting a thread) before the checkpoint, and after restore (join the thread). Rather than clobbering the 'general' C/R code with a fix for one issue, there is an interface perfectly suited to host this - a Resource. There's a catch, though - if we register it to the JDKContext this would be run *after* user Resources, and we need to run this *before*. Right now we don't have any means to order things this way. One solution could be reversing the JDKContext/GlobalContext relationship - JDKContext would be the parent, and would use a specific priority class for GlobalContext (something before `NORMAL`). And there would be one more priority class even before that, and this `KeepaliveResource` would be registered at it. src/java.base/share/classes/jdk/crac/Core.java line 254: > 252: // The notifications are done on the original thread. > 253: CountDownLatch start = new CountDownLatch(1); > 254: CountDownLatch finish = new CountDownLatch(1); Rather than 2 CountDownLatches you could use either single `CyclicBarrier` (awaiting twice), or even better a `Phaser` that has API for non-interruptible wait. test/jdk/jdk/crac/DaemonAfterRestore.java line 82: > 80: System.out.println("worker thread finish"); > 81: }); > 82: workerThread.setDaemon(false); Unnecessary, thread created from main thread (non-daemon) is not a daemon either. test/jdk/jdk/crac/DaemonAfterRestore.java line 94: > 92: @Override > 93: public void beforeCheckpoint(Context context) throws Exception { > 94: finish.countDown(); It might be worth asserting that here we are running in a daemon thread. test/jdk/jdk/crac/DaemonAfterRestore.java line 98: > 96: @Override > 97: public void afterRestore(Context context) throws Exception { > 98: Thread.sleep(3000); Could we avoid extending the testsuite run by 3 seconds? I know we're trying to assert that 'nothing happens' rather then replacing waiting for an event by a timed wait. If we can't check that the destroy VM thread had a chance to work, I suggest at least reducing this (say 100ms?). ------------- PR Review: https://git.openjdk.org/crac/pull/62#pullrequestreview-1413342072 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185159712 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185171160 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185175830 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185196948 From akozlov at openjdk.org Thu May 4 16:47:52 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 16:47:52 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v5] In-Reply-To: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> References: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> Message-ID: <1z4j4m_Jk6Cb2QfjGxTFEJp0ah_pOsxxSfckjRvv8CI=.02a74c4c-87e5-4975-a528-245d3caa00c2@github.com> On Thu, 4 May 2023 13:20:45 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Fix javadoc build This change gets too complex with refactoring, API changes, documentation, and fixes. Let's simplify this. Regarding fix, I'd like the minimal change without refactorings, API changes, doc changes. I believe ACI changes are not necessary for that. The test is nice. src/java.base/share/classes/javax/crac/Core.java line 53: > 51: * reference to the resource - otherwise the garbage collector > 52: * is free to trash the resource and notifications on this resource > 53: * will not be invoked. Instead, highlight the rationale behind weak registration: it does not change live-cycle the object, so any object may register with the Context without additional implications rather than notification. But if the object is not strongly-reachable, it can be collected before the notification. src/java.base/share/classes/javax/crac/package-info.java line 87: > 85: *

When an exception is thrown during notificaion, it is caught by the {@code Context} and is suppressed by a {@code CheckpointException} or {@code RestoreException}, depends on the throwing method. > 86: *

> 87: *

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com>

<50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com> Message-ID: On Tue, 2 May 2023 10:41:36 GMT, Radim Vansa wrote: >> Just realized that this is required for PriorityContext implementation. But that is the implementation of that class, it's wrong ACI has to care about that. > > Yes, it's a bit of enforced flexibility of the base class (through allowing to override a method), though it doesn't need about the use case. > > I could use just a list rather than sub-contexts but that would require duplicated code. That would be a cleaner approach compared to interdependencies of ACI-PC (details of PC leaks to ACI) >>> The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. ... I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. >> >> The parent Context may implement an artbitrary handling (for example, unloading a component compleltely if that is throwing an exception). So throwing an exception is something useful. >> >> With that, the new Core.recordException is completely new exception flow, that just opimizes somthing the generic throw scheme. With that, the generic schemes should be something good enough already, we don't need to complicate the interface, the code,.. CheckpointExcepotions are still exceptions, that is, we don't expect them often, there is no need to optimize them. > > It's not about optimization (in the sense of performance) but about removing code bloat, and the need for parent context to tediously copy failures (deciding whether something was a wrapper exception or the actual failure), when all you need to do in the end is to report them in bulk. It's not a new exception flow, it's removing the flow as there is no exception flow needed. > > Your example is invalid: If a throwing resource is to be removed, it's the task of the parent context - and that one will see the exception. The parent context should not remove its child context since one of the N resources in the child context is failing. You're trading additional code for a complicated interface. What are the directions what to use: Core.recordException() vs throw new Exception() ? This two very similar interfaces are the sign we are trying to do something strange. > Your example is invalid: If a throwing resource is to be removed, it's the task of the parent context - and that one will see the exception. The parent context should not remove its child context since one of the N resources in the child context is failing. Could you elaborate?.. I'm lost what "remove child context" means. I was reffering to the parent context being able to handle a CheckpointException from the child context. >> I'm trying to describe class hierachy and failing. The patch tries to reverse ACI-PC relation. ACI was for partially-ordered Resources (defined by a Comparator), and now it's for totally-ordered Resources (ordered by long). Trying to fit the partially-ordering PC as a subclass of the totally-ordering ACI feels unnatural. Can we have a cleaner hierachy of the classes? > > No; ACI was totally-ordered in the previous version of the PR, but now it doesn't care about ordering at all; it's the abstract `runBeforeCheckpoint` that decides on the order. I see, thanks. So ACI task is to track beforeCheckpoint order and provide afterRestore in the opposite order, this seems the most useful part of it. But its interface does not seem to be very consistent. Does it really needs an abstract base class that also maintains calling, logging and exception propagation? It will be nice to separate all this concerns for greater flexibility e.g. with composition of some classes in a Context implementation, rather than that to extend the base class with all of them combined. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185090313 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185113136 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185223155 From akozlov at openjdk.org Thu May 4 17:17:46 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 17:17:46 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References: Message-ID: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Test update ------------- Changes: - all: https://git.openjdk.org/crac/pull/62/files - new: https://git.openjdk.org/crac/pull/62/files/398cc79e..1dd4e20e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=62&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=62&range=00-01 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From akozlov at openjdk.org Thu May 4 17:25:46 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 17:25:46 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References:

Message-ID: On Thu, 4 May 2023 17:17:46 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Test update Thank you for review. > Rather than clobbering the 'general' C/R code with a fix for one issue, there is an interface perfectly suited to host this - a Resource. Having Resource abstraction does not mean that is necessary to use. Resources are suited for objects which may or may not exist and still need to receive notifications. Here we have no problems with doing something directly. So we don't need to rely on some implicit ordering, nor don't need to change Contextes structure. ------------- PR Review: https://git.openjdk.org/crac/pull/62#pullrequestreview-1413543170 From akozlov at openjdk.org Thu May 4 17:25:48 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 17:25:48 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References:

Message-ID: <-d4Kyg0bXgZ03thUNHiuc4qbedXjxmevSfwwHgzejT4=.d716ad99-573c-46ff-bf75-e84c851936db@github.com> On Thu, 4 May 2023 15:04:11 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Test update > > src/java.base/share/classes/jdk/crac/Core.java line 254: > >> 252: // The notifications are done on the original thread. >> 253: CountDownLatch start = new CountDownLatch(1); >> 254: CountDownLatch finish = new CountDownLatch(1); > > Rather than 2 CountDownLatches you could use either single `CyclicBarrier` (awaiting twice), or even better a `Phaser` that has API for non-interruptible wait. Since both of suggested options are reusable, having separted events is cleaner and reduce chances to accidentally await when that was intended. > test/jdk/jdk/crac/DaemonAfterRestore.java line 82: > >> 80: System.out.println("worker thread finish"); >> 81: }); >> 82: workerThread.setDaemon(false); > > Unnecessary, thread created from main thread (non-daemon) is not a daemon either. Otherwise there should be an assert, so it's more straightforward to set the mode. > test/jdk/jdk/crac/DaemonAfterRestore.java line 98: > >> 96: @Override >> 97: public void afterRestore(Context context) throws Exception { >> 98: Thread.sleep(3000); > > Could we avoid extending the testsuite run by 3 seconds? I know we're trying to assert that 'nothing happens' rather then replacing waiting for an event by a timed wait. If we can't check that the destroy VM thread had a chance to work, I suggest at least reducing this (say 100ms?). 100ms does not look as a good candidate (comparable to a single scheduling period). So 3 sec does not look to bad to avoid interrmittent false positive because sleep() finished before VM has terminated. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185290868 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185291965 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185295106 From akozlov at openjdk.org Thu May 4 18:13:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 18:13:43 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References:

Message-ID: On Thu, 4 May 2023 09:54:05 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 >> >> That IMO does not preclude trying the same for this case. >> >> - Debian 11 x86_64: It does not work, glibc is too different and inlined there. >> - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. >> - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > -altstack > * On older glibcs not supporting macro `CPU_FEATURE_ACTIVE` the disabling of glibc features has no effect (and it may crash the migration even after using `-XX:CPUFeatures=generic`). What is GLIBC version supporting the flag? We're used to build JDK on some older platform and assume that will work on every newer platform. And it turns out on my platform used for the builds the option is not supported. Since it's required to specify TUNABLES in the text form, can we just define needed options names? I'll continue reviewing this PR. src/hotspot/cpu/x86/vm_version_x86.cpp line 679: > 677: > 678: uint64_t disable_CPU = 0; > 679: uint64_t disable_GLIBC = 0; Are these used in EXCESSIVEx macro? Could you please move these below then, closer to the use. src/hotspot/cpu/x86/vm_version_x86.cpp line 717: > 715: if ((excessive_CPU & CPU_SSE3) || > 716: (excessive_GLIBC & (GLIBC_CMPXCHG16 | GLIBC_LAHFSAHF))) { > 717: assert(!(excessive_CPU & CPU_SSE4_2), "(_features & CPU_SSE4_2) cannot happen"); Failed assert prints the failed condition, no need to repeat in the message. src/hotspot/cpu/x86/vm_version_x86.cpp line 731: > 729: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, FMA) > 730: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, LZCNT) > 731: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)) SKARA complians on this line src/hotspot/cpu/x86/vm_version_x86.cpp line 767: > 765: #else > 766: # define IF_ASSERT(x) > 767: #endif Exactly the definition of `DEBUG_ONLY`, please use that macro. src/hotspot/share/runtime/stubCodeGenerator.cpp line 62: > 60: void StubCodeDesc::thaw() { > 61: assert(_frozen, "repeated thaw operation"); > 62: _frozen = false; Is it still necessary? I've tried to comment this line out, and checkpoint-restore succeded for me. ------------- PR Review: https://git.openjdk.org/crac/pull/41#pullrequestreview-1413589870 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185335055 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185336541 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185347208 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185333573 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185318814 From duke at openjdk.org Thu May 4 20:32:49 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 20:32:49 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References:

Message-ID: On Thu, 4 May 2023 17:17:46 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Test update If not using a Resource, at least refactor that into a standalone (stateful) component, which may be called directly. The way it's written here mixes in some implementation details into a higher-level flow. ------------- PR Comment: https://git.openjdk.org/crac/pull/62#issuecomment-1535371199 From duke at openjdk.org Thu May 4 20:43:40 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 20:43:40 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v5] In-Reply-To: <1z4j4m_Jk6Cb2QfjGxTFEJp0ah_pOsxxSfckjRvv8CI=.02a74c4c-87e5-4975-a528-245d3caa00c2@github.com> References: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> <1z4j4m_Jk6Cb2QfjGxTFEJp0ah_pOsxxSfckjRvv8CI=.02a74c4c-87e5-4975-a528-245d3caa00c2@github.com> Message-ID: <64JIfZfl78X-cwOqXZrMb-NfHx4iSUEUmcqFK3F-imM=.8ea7d9ce-5b9c-4d81-9369-72c6102ac1aa@github.com> On Thu, 4 May 2023 13:59:20 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix javadoc build > > src/java.base/share/classes/javax/crac/package-info.java line 87: > >> 85: *

>> 87: *

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com>

<50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com> Message-ID: On Thu, 4 May 2023 14:35:46 GMT, Anton Kozlov wrote: > You're trading additional code for a complicated interface. What are the directions what to use: Core.recordException() vs throw new Exception() ? This two very similar interfaces are the sign we are trying to do something strange. That's because of the non-standard requirement to continue after encountering an exception, and multiple rewraps in a hierarchy of contexts. Resources should normally throw exceptions; Contexts are here to call into the Resource and aggregate errors. There's no point of propagating error higher up the hierarchy. > Could you elaborate?.. I'm lost what "remove child context" means. I was reffering to the parent context being able to handle a CheckpointException from the child context. And why would the child context throw the CheckpointException? The most common reason would be that one of X resources in child context has thrown. It's fine if a (child) Context removes the inner Resource after throwing. But that error should not propagate any higher, because if parent context were to remove it's (throwing) child context it would remove it along with the other X - 1 resources that are working correctly. >> No; ACI was totally-ordered in the previous version of the PR, but now it doesn't care about ordering at all; it's the abstract `runBeforeCheckpoint` that decides on the order. > > I see, thanks. So ACI task is to track beforeCheckpoint order and provide afterRestore in the opposite order, this seems the most useful part of it. But its interface does not seem to be very consistent. > > Does it really needs an abstract base class that also maintains calling, logging and exception propagation? It will be nice to separate all this concerns for greater flexibility e.g. with composition of some classes in a Context implementation, rather than that to extend the base class with all of them combined. We can further fine-tune the separation when the need arises and we have a concrete example. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185508424 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185510445 From duke at openjdk.org Thu May 4 21:35:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 21:35:41 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v6] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Updated Global Context javadoc, removed semanticContext() ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/33226eba..53ccc062 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=05 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=04-05 Stats: 50 lines in 4 files changed: 36 ins; 1 del; 13 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Thu May 4 21:52:50 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 21:52:50 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v7] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Add the missing javadoc ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/53ccc062..729a4537 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=06 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=05-06 Stats: 12 lines in 1 file changed: 11 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Thu May 4 21:54:47 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 21:54:47 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v6] In-Reply-To: References:

Message-ID: On Thu, 4 May 2023 21:35:41 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Updated Global Context javadoc, removed semanticContext() I've reworded Global Context javadoc, added forgotten javadoc to `recordException`, and replaced calling `semanticContext()` with inlined versions of the `invokeBeforeCheckpoint` and `invokeAfterRestore` methods. All changes in this PR are related to contexts. I see little point in dosing doc changes individually; API changes are here to support actual implementation of the fix. ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1535456836 From duke at openjdk.org Thu May 4 22:09:04 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 22:09:04 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v7] In-Reply-To: References:

Message-ID: On Thu, 4 May 2023 21:52:50 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Add the missing javadoc There is one negligence in this PR and that is the lack of test for concurrent registration and notification in the JDK context. I've omitted that partly as this was happening reproducibly in the `newfd` branch, therefore I thought a synthetic test was not necessary. I could add it, though, to demonstrate the behaviour. ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1535468906 From duke at openjdk.org Fri May 5 05:49:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 5 May 2023 05:49:44 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References:

<-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com>

<50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com>

Message-ID: On Thu, 4 May 2023 20:59:42 GMT, Radim Vansa wrote: >> You're trading additional code for a complicated interface. What are the directions what to use: Core.recordException() vs throw new Exception() ? This two very similar interfaces are the sign we are trying to do something strange. >> >>> Your example is invalid: If a throwing resource is to be removed, it's the task of the parent context - and that one will see the exception. The parent context should not remove its child context since one of the N resources in the child context is failing. >> >> Could you elaborate?.. I'm lost what "remove child context" means. I was reffering to the parent context being able to handle a CheckpointException from the child context. > >> You're trading additional code for a complicated interface. What are the directions what to use: Core.recordException() vs throw new Exception() ? This two very similar interfaces are the sign we are trying to do something strange. > > That's because of the non-standard requirement to continue after encountering an exception, and multiple rewraps in a hierarchy of contexts. > Resources should normally throw exceptions; Contexts are here to call into the Resource and aggregate errors. There's no point of propagating error higher up the hierarchy. > >> Could you elaborate?.. I'm lost what "remove child context" means. I was reffering to the parent context being able to handle a CheckpointException from the child context. > > And why would the child context throw the CheckpointException? The most common reason would be that one of X resources in child context has thrown. It's fine if a (child) Context removes the inner Resource after throwing. But that error should not propagate any higher, because if parent context were to remove it's (throwing) child context it would remove it along with the other X - 1 resources that are working correctly. I realized I haven't stressed enough one motivation for `Core.registerException`: when the context has finished its `beforeCheckpoint` (it won't be called again) and someone calls into this Context's `register` we cannot propagate the failure through throwing. So we won't avoid API change - and in my view it's natural to use the same method for reporting the failure during `beforeCheckpoint` execution, too. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185718419 From akozlov at openjdk.org Fri May 5 10:54:54 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 5 May 2023 10:54:54 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v3] In-Reply-To: References: Message-ID: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix recursiveCheckpoint test ------------- Changes: - all: https://git.openjdk.org/crac/pull/62/files - new: https://git.openjdk.org/crac/pull/62/files/1dd4e20e..86fe7eb7 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=62&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=62&range=01-02 Stats: 47 lines in 2 files changed: 24 ins; 19 del; 4 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From akozlov at openjdk.org Fri May 5 11:16:50 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 5 May 2023 11:16:50 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: References: Message-ID: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Cleanup ------------- Changes: - all: https://git.openjdk.org/crac/pull/62/files - new: https://git.openjdk.org/crac/pull/62/files/86fe7eb7..c2cfae5d Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=62&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=62&range=02-03 Stats: 46 lines in 1 file changed: 28 ins; 10 del; 8 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From duke at openjdk.org Fri May 5 16:03:53 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 5 May 2023 16:03:53 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: References:

Message-ID: On Fri, 5 May 2023 11:16:50 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Cleanup src/java.base/share/classes/jdk/crac/Core.java line 287: > 285: try { > 286: keepAlive = new KeepAlive(); > 287: } catch (InterruptedException e) { Upon catching InterruptedException you should set thread interrupted status. Any reason to use RuntimeException than CheckpointException? (preferrably in a comment). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1186258857 From duke at openjdk.org Fri May 5 16:03:53 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 5 May 2023 16:03:53 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: References:

Message-ID: On Fri, 5 May 2023 15:57:32 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Cleanup > > src/java.base/share/classes/jdk/crac/Core.java line 287: > >> 285: try { >> 286: keepAlive = new KeepAlive(); >> 287: } catch (InterruptedException e) { > > Upon catching InterruptedException you should set thread interrupted status. > > Any reason to use RuntimeException than CheckpointException? (preferrably in a comment). Also, if you're just rethrowing runtime exception you could move the try-catch into the class. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1186262728 From duke at openjdk.org Fri May 5 16:03:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 5 May 2023 16:03:54 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: <-d4Kyg0bXgZ03thUNHiuc4qbedXjxmevSfwwHgzejT4=.d716ad99-573c-46ff-bf75-e84c851936db@github.com> References:

<-d4Kyg0bXgZ03thUNHiuc4qbedXjxmevSfwwHgzejT4=.d716ad99-573c-46ff-bf75-e84c851936db@github.com> Message-ID: On Thu, 4 May 2023 17:12:32 GMT, Anton Kozlov wrote: >> test/jdk/jdk/crac/DaemonAfterRestore.java line 82: >> >>> 80: System.out.println("worker thread finish"); >>> 81: }); >>> 82: workerThread.setDaemon(false); >> >> Unnecessary, thread created from main thread (non-daemon) is not a daemon either. > > Otherwise there should be an assert, so it's more straightforward to set the mode. I agree with the assert ------------- PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1186259922 From duke at openjdk.org Tue May 9 15:34:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 9 May 2023 15:34:09 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v9] In-Reply-To: <2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> References:

<2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> Message-ID: On Tue, 2 May 2023 16:50:11 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Provide more information for file descriptors > > src/java.base/share/classes/java/io/FileDescriptor.java line 400: > >> 398: } else { >> 399: info = (path != null ? path : "unknown path") + " (" + (type != null ? type : "unknown") + ")"; >> 400: } > > This have too many socket-related details, also a number of java/native transitions that will be unavoidable if we adopt the proposed interface. I am thinking about this also in a context of `newfd-policies`, where we have to record not only the type & path but also things like current offset etc. Since this is used only in error handling, I wouldn't mind about the native transitions (or performance in general) too much. If you want, I could create a POJO to cross the native border only once. However this goes a bit against the principle that we want to handle as much as we can in Java rather than native code. That's also the reason why I opted for the formatting in Java even though it might be just as simple to do it in the native. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1188785933 From duke at openjdk.org Tue May 9 15:38:11 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 9 May 2023 15:38:11 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v9] In-Reply-To: <2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> References:

<2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> Message-ID: On Tue, 2 May 2023 17:00:11 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Provide more information for file descriptors > > src/java.base/unix/native/libnet/SocketImpl.c line 115: > >> 113: return NULL; >> 114: } >> 115: } > > No need for `} else {` here and everywhere else since the previous block has anyway terminated the function. > > This will make the code more streamlined. > > Suggestion: > > } > > localAddr = create_isa(env, isa_class, isa_ctor, &local); > if (localAddr == NULL) { > JNU_ThrowOutOfMemoryError(env, "java.net.InetSocketAddres"); > return NULL; > } The code was supposed to mirror closer the remote part where we can't do that, but OK. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1188790339 From duke at openjdk.org Tue May 9 16:01:30 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 9 May 2023 16:01:30 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: References: Message-ID: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Revert API change, force blocking registration ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/729a4537..868baeae Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=07 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=06-07 Stats: 338 lines in 8 files changed: 168 ins; 88 del; 82 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From akozlov at openjdk.org Tue May 9 18:24:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 9 May 2023 18:24:53 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: <64JIfZfl78X-cwOqXZrMb-NfHx4iSUEUmcqFK3F-imM=.8ea7d9ce-5b9c-4d81-9369-72c6102ac1aa@github.com> References: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> <1z4j4m_Jk6Cb2QfjGxTFEJp0ah_pOsxxSfckjRvv8CI=.02a74c4c-87e5-4975-a528-245d3caa00c2@github.com> <64JIfZfl78X-cwOqXZrMb-NfHx4iSUEUmcqFK3F-imM=.8ea7d9ce-5b9c-4d81-9369-72c6102ac1aa@github.com> Message-ID: On Thu, 4 May 2023 20:41:11 GMT, Radim Vansa wrote: >> src/java.base/share/classes/javax/crac/package-info.java line 87: >> >>> 85: *

>>> 87: *

When the {@code Resource} is a {@code Context} and it throws {@code CheckpointException} or {@code RestoreException}, exceptions suppressed by the original exception are suppressed by another {@code CheckpointException} or {@code RestoreException}, depends on the throwing method. >> >> This sepcification for child context was lost > > Intentionally. With reporting the exception centrally there's no reason to inform parent context about an error in a resource in child component. The reply no longer valid. Please keep modifications in the javadoc to minimum as well ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1188966993 From akozlov at openjdk.org Tue May 9 18:24:54 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 9 May 2023 18:24:54 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> References: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> Message-ID: On Tue, 9 May 2023 16:01:30 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert API change, force blocking registration src/java.base/share/classes/jdk/crac/Resource.java line 65: > 63: * resource throwing an exception when {@link #beforeCheckpoint(Context) > 64: * beforeCheckpoint}. > 65: * Therefore, the resource should not have assumptions about it state; it Resource can be sure the beforeCheckpoint was called, and object is exactly in the state at which the beforeCheckpoint has leaved it. src/java.base/share/classes/jdk/crac/impl/OrderedContext.java line 76: > 74: @Override > 75: public void afterRestore(Context context) throws RestoreException { > 76: // Note: a resource might attempt to Comment truncated?.. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1188970172 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1188980196 From duke at openjdk.org Wed May 10 06:27:40 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 06:27:40 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References:

Message-ID: On Thu, 4 May 2023 17:23:28 GMT, Anton Kozlov wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Test update > > Thank you for review. > >> Rather than clobbering the 'general' C/R code with a fix for one issue, there is an interface perfectly suited to host this - a Resource. > > Having Resource abstraction does not mean that is necessary to use. Resources are suited for objects which may or may not exist and still need to receive notifications. Here we have no problems with doing something directly. So we don't need to rely on some implicit ordering, nor don't need to change Contextes structure. @AntonKozlov I've addressed the interrupts and moved KeepAlive to separate impl class in https://github.com/rvansa/crac/tree/daemon-after-restore - could you ff-merge into your branch to avoid opening another PR? ------------- PR Comment: https://git.openjdk.org/crac/pull/62#issuecomment-1541421555 From duke at openjdk.org Wed May 10 07:11:35 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 07:11:35 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v9] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Update docs ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/868baeae..9e81179f Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=08 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=07-08 Stats: 16 lines in 4 files changed: 10 ins; 2 del; 4 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Wed May 10 07:11:36 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 07:11:36 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> References: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> Message-ID: On Tue, 9 May 2023 16:01:30 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert API change, force blocking registration Updated docs, no code changes. ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1541463175 From duke at openjdk.org Wed May 10 07:11:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 07:11:38 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: References: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> Message-ID: On Tue, 9 May 2023 18:14:31 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert API change, force blocking registration > > src/java.base/share/classes/jdk/crac/Resource.java line 65: > >> 63: * resource throwing an exception when {@link #beforeCheckpoint(Context) >> 64: * beforeCheckpoint}. >> 65: * Therefore, the resource should not have assumptions about it state; it > > Resource can be sure the beforeCheckpoint was called, and object is exactly in the state at which the beforeCheckpoint has leaved it. I'll reword. The meaning here was to guide user to handling unexpected states, e.g. if the beforeCheckpoint is locking a lock this should not blindly unlock. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189445759 From duke at openjdk.org Wed May 10 09:05:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 09:05:38 GMT Subject: [crac] RFR: Support passing extra options in CREngine Message-ID: In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,key=value,anotherkey` that translates into invoking `program --key value --anotherkey`. This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. ------------- Commit messages: - Support passing extra options in CREngine Changes: https://git.openjdk.org/crac/pull/63/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=63&range=00 Stats: 150 lines in 3 files changed: 124 ins; 5 del; 21 mod Patch: https://git.openjdk.org/crac/pull/63.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/63/head:pull/63 PR: https://git.openjdk.org/crac/pull/63 From duke at openjdk.org Wed May 10 09:38:50 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 09:38:50 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Remove mention of single-threaded notifications ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/9e81179f..286d820d Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=09 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=08-09 Stats: 10 lines in 2 files changed: 0 ins; 6 del; 4 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From akozlov at openjdk.org Wed May 10 10:08:51 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 10:08:51 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References:

Message-ID: <-gXSpBhareHQmkkgHmzZqGv010PEdhYW8C1HZjcirF4=.69e3174c-32dd-4d82-96c0-7d48ab169001@github.com> On Wed, 10 May 2023 09:38:50 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove mention of single-threaded notifications First itertation, have not looked carefully on the doc and context changes src/java.base/share/classes/jdk/internal/crac/JDKContext.java line 42: > 40: }; > 41: > 42: public JDKContext() { Why `public`? src/java.base/share/classes/jdk/internal/crac/LoggerContainer.java line 10: > 8: */ > 9: public class LoggerContainer { > 10: public static final System.Logger logger = System.getLogger("jdk.internal.crac"); Please keep `jdk.crac` as having the code in jdk.internal.crac is implementation detail, but this name is a configuration interface for users. test/jdk/jdk/crac/LazyProps.java line 54: > 52: Core.getGlobalContext().register(resource); > 53: > 54: System.setProperty("jdk.crac.debug", "true"); The test was added in https://github.com/openjdk/crac/pull/12, preventing problems with access to the properties happens before j.l.System initialized. But the test checks the logging can be enabled as late as possible before checkpoint. Since the logging enabled in the command line, the test makes a little sense now... ------------- PR Review: https://git.openjdk.org/crac/pull/60#pullrequestreview-1420232750 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189650678 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189655234 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189666134 From akozlov at openjdk.org Wed May 10 10:40:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 10:40:43 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References:

Message-ID: On Wed, 10 May 2023 09:38:50 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove mention of single-threaded notifications Context changes are mostly good, except a small nit src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 58: > 56: } > 57: if (e.getMessage() != null) { > 58: ce.addSuppressed(e); What happens with exceptions suppressed by `e`? will we have the same set of exceptions suppressed by `ce` and `e`? ------------- PR Review: https://git.openjdk.org/crac/pull/60#pullrequestreview-1420308681 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189698760 From duke at openjdk.org Wed May 10 10:50:49 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 10:50:49 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References:

Message-ID: <5PQTBR3qsRGJ0sTa0goQO-3MIHbUgEkl5406pgVKc_8=.496ff324-d3b3-46e6-97be-7f4c8cc78d8f@github.com> On Wed, 10 May 2023 09:38:50 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove mention of single-threaded notifications test/jdk/jdk/crac/ContextOrderTest.java line 54: > 52: System.setProperty("java.util.logging.config.file", Utils.TEST_SRC + "/logging.properties"); > 53: > 54: // testOrder(); Oops, noticed this got commented out. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189726828 From duke at openjdk.org Wed May 10 10:56:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 10:56:41 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References:

Message-ID: On Wed, 10 May 2023 10:22:55 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove mention of single-threaded notifications > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 58: > >> 56: } >> 57: if (e.getMessage() != null) { >> 58: ce.addSuppressed(e); > > What happens with exceptions suppressed by `e`? will we have the same set of exceptions suppressed by `ce` and `e`? Yes; there's no way to remove already suppressed exceptions, and the message is too valuable to lose it. I could create an exception with the same message but without suppressed exceptions - then I'd lose its stack trace, though. I don't see much issues in this duplication. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189732758 From duke at openjdk.org Wed May 10 11:08:05 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 11:08:05 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v11] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with three additional commits since the last revision: - Fix build - Minified the set of changes - Fixup ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/286d820d..4adad664 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=10 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=09-10 Stats: 268 lines in 15 files changed: 62 ins; 178 del; 28 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Wed May 10 11:14:00 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 11:14:00 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v12] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Make comparator private ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/4adad664..29272639 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=11 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=10-11 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Wed May 10 12:26:48 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 12:26:48 GMT Subject: [crac] RFR: Minor code cleanup and improvements Message-ID: Extracted non-essential changes from other PR. ------------- Commit messages: - Minor code cleanup and improvements Changes: https://git.openjdk.org/crac/pull/64/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=64&range=00 Stats: 81 lines in 8 files changed: 47 ins; 16 del; 18 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From duke at openjdk.org Wed May 10 12:49:03 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 12:49:03 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Revert removing the logging configuration ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/29272639..841d0989 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=12 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=11-12 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Wed May 10 12:49:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 12:49:14 GMT Subject: [crac] RFR: Refactored javadocs with additional details Message-ID: Improve API documentation. ------------- Commit messages: - Refactored javadocs with additional details Changes: https://git.openjdk.org/crac/pull/65/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=65&range=00 Stats: 160 lines in 6 files changed: 105 ins; 46 del; 9 mod Patch: https://git.openjdk.org/crac/pull/65.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/65/head:pull/65 PR: https://git.openjdk.org/crac/pull/65 From akozlov at openjdk.org Wed May 10 13:14:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 13:14:53 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References:

Message-ID: <0xqgNmVgOMlldlClyg_1cfS-cyPC4as5CRyhdSJUwgU=.3e67c52e-0f7c-4572-9d97-dfb69a0f45ff@github.com> On Wed, 10 May 2023 10:54:03 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 58: >> >>> 56: } >>> 57: if (e.getMessage() != null) { >>> 58: ce.addSuppressed(e); >> >> What happens with exceptions suppressed by `e`? will we have the same set of exceptions suppressed by `ce` and `e`? > > Yes; there's no way to remove already suppressed exceptions, and the message is too valuable to lose it. I could create an exception with the same message but without suppressed exceptions - then I'd lose its stack trace, though. > I don't see much issues in this duplication. It seems we should have allowed the message in the CheckpointException, to avoid this problem. The only place where CheckpointException with the message is called is recursive checkpoint detection from https://github.com/openjdk/crac/pull/6. OK for this PR. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189889412 From duke at openjdk.org Wed May 10 13:50:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 13:50:54 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: <0xqgNmVgOMlldlClyg_1cfS-cyPC4as5CRyhdSJUwgU=.3e67c52e-0f7c-4572-9d97-dfb69a0f45ff@github.com> References:

<0xqgNmVgOMlldlClyg_1cfS-cyPC4as5CRyhdSJUwgU=.3e67c52e-0f7c-4572-9d97-dfb69a0f45ff@github.com> Message-ID: On Wed, 10 May 2023 13:10:24 GMT, Anton Kozlov wrote: >> Yes; there's no way to remove already suppressed exceptions, and the message is too valuable to lose it. I could create an exception with the same message but without suppressed exceptions - then I'd lose its stack trace, though. >> I don't see much issues in this duplication. > > It seems we should have allowed the message in the CheckpointException, to avoid this problem. The only place where CheckpointException with the message is called is recursive checkpoint detection from https://github.com/openjdk/crac/pull/6. > > OK for this PR. That's not entirely true; CheckpointException is a public API and Resources can use it. In fact I find it quite natural to use for reporting any 'generic' error during checkpoint; if we want to limit usage to internal we need to make the constructors package-private constructors. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189940797 From akozlov at openjdk.org Wed May 10 14:50:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 14:50:53 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v5] In-Reply-To: References: Message-ID: <1h2s-b7sop_69bx5PbavykMnj1upfRIfczQOx4Ylwtg=.7ad2d784-5ccf-41a1-96f8-2e6ccd72ca48@github.com> > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Move KeepAlive to separate class, handle interrupts ------------- Changes: - all: https://git.openjdk.org/crac/pull/62/files - new: https://git.openjdk.org/crac/pull/62/files/c2cfae5d..a3b1242e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=62&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=62&range=03-04 Stats: 132 lines in 3 files changed: 75 ins; 52 del; 5 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From akozlov at openjdk.org Wed May 10 14:50:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 14:50:53 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: References:

Message-ID: On Fri, 5 May 2023 11:16:50 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Cleanup Thanks for the follow-up! I've added the commit to this PR. ------------- PR Comment: https://git.openjdk.org/crac/pull/62#issuecomment-1542330092 From akozlov at openjdk.org Wed May 10 15:06:45 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:06:45 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References:

Message-ID: On Wed, 3 May 2023 11:25:11 GMT, Radim Vansa wrote: >> src/hotspot/share/runtime/globals.hpp line 2129: >> >>> 2127: "Throw CheckpointException if uncheckpointable resource handle found")\ >>> 2128: \ >>> 2129: product(bool, CRTrace, true, MANAGEABLE, "Minimal C/R tracing") \ >> >> RESTORE_SETTABLE was meant here? Please don't mix in MANAGEABLE flags into this PR if that was inteded. > > Looking for usages (actually only one) of the flag, it qualifies to be set at any time. // MANAGEABLE flags are writeable external product flags. // They are dynamically writeable through the JDK management interface // (com.sun.management.HotSpotDiagnosticMXBean API) and also through JConsole. // These flags are external exported interface (see CCC). The list of // manageable flags can be queried programmatically through the management // interface. Manageable does not mean "can be set at any time". All product flags are part of the VM interface, but MANAGEABLE is stricter. Actually, the flag should be eliminated and replaced with Unified Logging, so let's just set RESETORE_SETTABLE for this. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1190047188 From akozlov at openjdk.org Wed May 10 15:21:58 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:21:58 GMT Subject: git: openjdk/crac: crac: Fix ordering of invocation on Resources Message-ID: <3d284819-82c1-4d44-8563-d8481f190898@openjdk.org> Changeset: ef2437e7 Author: Radim Vansa Committer: Anton Kozlov Date: 2023-05-10 15:21:01 +0000 URL: https://git.openjdk.org/crac/commit/ef2437e7aaaabcbb58366eb84efbb7ebe5934c1f Fix ordering of invocation on Resources Reviewed-by: akozlov ! src/java.base/share/classes/javax/crac/Core.java ! src/java.base/share/classes/jdk/crac/Core.java ! src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java ! src/java.base/share/classes/jdk/crac/impl/OrderedContext.java + src/java.base/share/classes/jdk/crac/impl/PriorityContext.java ! src/java.base/share/classes/jdk/internal/crac/JDKContext.java + src/java.base/share/classes/jdk/internal/crac/LoggerContainer.java ! src/java.base/share/classes/jdk/internal/util/jar/PersistentJarFile.java + test/jdk/jdk/crac/ContextOrderTest.java - test/jdk/jdk/crac/LazyProps.java - test/jdk/jdk/crac/ResourceTest.java + test/jdk/jdk/crac/logging.properties From duke at openjdk.org Wed May 10 15:23:25 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 15:23:25 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v4] In-Reply-To: References: Message-ID: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Make CRTrace RESTORE_SETTABLE rather than MANAGEABLE ------------- Changes: - all: https://git.openjdk.org/crac/pull/61/files - new: https://git.openjdk.org/crac/pull/61/files/75ce1b64..920b3be8 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=61&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=61&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/61.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/61/head:pull/61 PR: https://git.openjdk.org/crac/pull/61 From akozlov at openjdk.org Wed May 10 15:24:50 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:24:50 GMT Subject: [crac] RFR: Minor code cleanup and improvements In-Reply-To: References: Message-ID: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> On Wed, 10 May 2023 12:20:07 GMT, Radim Vansa wrote: > Extracted non-essential changes from other PR. src/java.base/share/classes/javax/crac/CheckpointException.java line 50: > 48: * @param message the detail message. > 49: */ > 50: public CheckpointException(String message) { What if we remove this constructor and hide the other one? https://github.com/openjdk/crac/pull/60/files#r1190070472 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1190072341 From akozlov at openjdk.org Wed May 10 15:25:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:25:56 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References:

<0xqgNmVgOMlldlClyg_1cfS-cyPC4as5CRyhdSJUwgU=.3e67c52e-0f7c-4572-9d97-dfb69a0f45ff@github.com> Message-ID: On Wed, 10 May 2023 13:48:08 GMT, Radim Vansa wrote: >> It seems we should have allowed the message in the CheckpointException, to avoid this problem. The only place where CheckpointException with the message is called is recursive checkpoint detection from https://github.com/openjdk/crac/pull/6. >> >> OK for this PR. > > That's not entirely true; CheckpointException is a public API and Resources can use it. In fact I find it quite natural to use for reporting any 'generic' error during checkpoint; if we want to limit usage to internal we need to make the constructors package-private constructors. Hiding all constructors makes sense. I think it will be helpful to have CheckpointException, CheckpointResourceExceptions and descedants something formal, something that can be programmatically queried for the problem. I propose to investigate the possiblity in #64 which touches related pieces. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1190070472 From duke at openjdk.org Wed May 10 15:26:00 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 15:26:00 GMT Subject: [crac] Integrated: Fix ordering of invocation on Resources In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 09:01:07 GMT, Radim Vansa wrote: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though This pull request has now been integrated. Changeset: ef2437e7 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/ef2437e7aaaabcbb58366eb84efbb7ebe5934c1f Stats: 885 lines in 12 files changed: 622 ins; 203 del; 60 mod Fix ordering of invocation on Resources Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/60 From akozlov at openjdk.org Wed May 10 15:40:44 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:40:44 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v4] In-Reply-To: References:

Message-ID: On Wed, 10 May 2023 15:23:25 GMT, Radim Vansa wrote: >> When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Make CRTrace RESTORE_SETTABLE rather than MANAGEABLE Thank you! LGTM ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/61#pullrequestreview-1420925330 From duke at openjdk.org Wed May 10 16:06:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 16:06:45 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v10] In-Reply-To: References: Message-ID: > Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. > File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: - Merge remote-tracking branch 'origin/crac' into newfd - Fixup - Merge branch 'context_order' into newfd - Revert removing the logging configuration - Make comparator private - Fix build - Minified the set of changes - Fixup - Remove mention of single-threaded notifications - Update docs - ... and 41 more: https://git.openjdk.org/crac/compare/ef2437e7...8fd3566c ------------- Changes: https://git.openjdk.org/crac/pull/43/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=43&range=09 Stats: 999 lines in 33 files changed: 682 ins; 276 del; 41 mod Patch: https://git.openjdk.org/crac/pull/43.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/43/head:pull/43 PR: https://git.openjdk.org/crac/pull/43 From duke at openjdk.org Wed May 10 16:06:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 16:06:45 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v9] In-Reply-To: References:

Message-ID: <-aDBFxAmClx32o8FzumSOu80J2TaWlZO7bOTEr6hXHE=.85a43788-8ea4-4920-8c83-9a7d9e6e0c17@github.com> On Mon, 24 Apr 2023 13:27:30 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Provide more information for file descriptors Converting to draft until #60 gets integrated. Also note that RefQueueTest is failing (at least on my machine) because of race documented in that test. ------------- PR Comment: https://git.openjdk.org/crac/pull/43#issuecomment-1540461583 From akozlov at openjdk.org Wed May 10 17:44:44 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 17:44:44 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v10] In-Reply-To: References:

Message-ID: On Wed, 10 May 2023 16:06:45 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Fixup > - Merge branch 'context_order' into newfd > - Revert removing the logging configuration > - Make comparator private > - Fix build > - Minified the set of changes > - Fixup > - Remove mention of single-threaded notifications > - Update docs > - ... and 41 more: https://git.openjdk.org/crac/compare/ef2437e7...8fd3566c I have a follow-up change based on this, but it's rather massive. So I propose to integrate this and for me to create an another PR immediately. Sounds good? ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/43#pullrequestreview-1421118278 From duke at openjdk.org Thu May 11 06:16:13 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 06:16:13 GMT Subject: [crac] RFR: Minor code cleanup and improvements In-Reply-To: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> Message-ID: <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> On Wed, 10 May 2023 15:21:57 GMT, Anton Kozlov wrote: >> Extracted non-essential changes from other PR. > > src/java.base/share/classes/javax/crac/CheckpointException.java line 50: > >> 48: * @param message the detail message. >> 49: */ >> 50: public CheckpointException(String message) { > > What if we remove this constructor and hide the other one? https://github.com/openjdk/crac/pull/60/files#r1190070472 First of all, I think that `javax.crac` should mirror `jdk.crac` API- and docs-wise. It will be much easier when everyone will be able to just change the imports. About the constructor with message: I find a bit confusing when an exception is thrown because of some problem with `criu` but there's no actionable message. I have added a simple 'Native checkpoint failed' but we should probably point user to the dump4.log file. (`criuengine` should also make some sanity checks on permissions but that's another thing). I wouldn't object to hiding it, though. About the one without: `Context` narrows the `throws` to CE/RE and since we expect users to implement Context this would give them no chance to throw checked exceptions, not even the aggregating one. Hiding that won't work. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1190679470 From duke at openjdk.org Thu May 11 06:21:08 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 06:21:08 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v10] In-Reply-To: References:

Message-ID: <0bG9Cm3G8opo4NCbEzb3KMLRnuGeinvXf1Dxip-AyNQ=.2ec4c2e9-7436-48a5-a65f-1419c49df2f4@github.com> On Wed, 10 May 2023 16:06:45 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Fixup > - Merge branch 'context_order' into newfd > - Revert removing the logging configuration > - Make comparator private > - Fix build > - Minified the set of changes > - Fixup > - Remove mention of single-threaded notifications > - Update docs > - ... and 41 more: https://git.openjdk.org/crac/compare/ef2437e7...8fd3566c Alright, go for it. ------------- PR Comment: https://git.openjdk.org/crac/pull/43#issuecomment-1543400877 From duke at openjdk.org Thu May 11 06:25:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 06:25:09 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v5] In-Reply-To: <1h2s-b7sop_69bx5PbavykMnj1upfRIfczQOx4Ylwtg=.7ad2d784-5ccf-41a1-96f8-2e6ccd72ca48@github.com> References: <1h2s-b7sop_69bx5PbavykMnj1upfRIfczQOx4Ylwtg=.7ad2d784-5ccf-41a1-96f8-2e6ccd72ca48@github.com> Message-ID: On Wed, 10 May 2023 14:50:53 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Move KeepAlive to separate class, handle interrupts I am practically approving my own changes, but ok. ------------- Marked as reviewed by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/62#pullrequestreview-1421830947 From duke at openjdk.org Thu May 11 14:28:15 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 14:28:15 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References:

Message-ID: On Thu, 27 Apr 2023 11:55:53 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use image under ghcr.io/crac src/hotspot/share/runtime/os.cpp line 2069: > 2067: } > 2068: } > 2069: } No newline at end of file src/java.base/share/classes/java/lang/System.java line 69: > 67: import java.util.concurrent.ConcurrentHashMap; > 68: import java.util.stream.Stream; > 69: Excessive whitespace change. src/java.base/share/classes/java/lang/System.java line 2453: > 2451: }); > 2452: } > 2453: Excessive whitespace change. test/hotspot/jtreg/testlibrary/jittester/conf/exclude.methods.lst line 29: > 27: java/lang/System::loadLibrary(Ljava/lang/String;) > 28: java/lang/System::mapLibraryName(Ljava/lang/String;) > 29: java/lang/System::nanoTime0() Is this change really needed? test/jdk/jdk/crac/java/lang/System/NanoTimeTest.java line 82: > 80: "-e", "LD_PRELOAD=/opt/path-mapping-quiet.so", > 81: "-e", "PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id", > 82: CracBuilder.CONTAINER_NAME, CracBuilder.DOCKER_JAVA); On Fedora 36 x86_64 the testcase does not work for me: Starting docker container: docker run --rm -d --privileged --init --volume /home/azul/azul/crac-git/JTwork/classes/jdk/crac/java/lang/System/NanoTimeTest.d:/cp/0 --volume /home/azul/azul/crac-git/JTwork/classes/test/lib:/cp/1 --volume cr:/cr --volume /home/azul/azul/crac-git/build/linux-x86_64-server-fastdebug/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test -v /tmp/NanoTimeTest-3201524983642970594-boot_id:/fake_boot_id jdk-internal:test-system-nanotime sleep 3600 Starting process to be checkpointed: docker exec -e LD_PRELOAD=/opt/path-mapping-quiet.so -e PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id crac-test /jdk/bin/java -ea -cp /cp/0:/cp/1: -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ NanoTimeTest 0 true /criu: error while loading shared libraries: libbsd.so.0: cannot open shared object file: No such file or directory Exception in thread "main" jdk.crac.CheckpointException ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191248589 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191248975 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191249120 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191249848 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191253534 From duke at openjdk.org Thu May 11 14:32:16 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 14:32:16 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References:

Message-ID: On Thu, 27 Apr 2023 11:55:53 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use image under ghcr.io/crac On Fedora 36 x86_64 when I snapshot the image, reboot and restore it with boottime earlier than it was during the snapshot I get a hanging restore: #0 0x00007f43778899b9 in __futex_abstimed_wait_common () from /lib64/libc.so.6 #1 0x00007f437788e983 in __pthread_clockjoin_ex () from /lib64/libc.so.6 #2 0x00007f4377abe655 in CallJavaMainInNewThread (stack_size=1048576, args=0x7ffdd525b5d0) at ../src/java.base/unix/native/libjli/java_md.c:681 #3 0x00007f4377abb7fd in ContinueInNewThread (ifn=0x7ffdd525b6d0, ifn at entry=0x0, threadStackSize=, argc=, argv=0x5641c59c75f8, mode=mode at entry=1, what=0x5641c59c7380 "NanoTime", what at entry=0x0, ret=0) at ../src/java.base/share/native/libjli/java.c:2362 #4 0x00007f4377abe709 in JVMInit (ifn=0x0, ifn at entry=0x7ffdd525b6d0, threadStackSize=, argc=, argv=, mode=mode at entry=1, what=0x0, what at entry=0x5641c59c7380 "NanoTime", ret=) at ../src/java.base/unix/native/libjli/java_md.c:706 #5 0x00007f4377abc4e0 in JLI_Launch (argc=, argc at entry=3, argv=, argv at entry=0x5641c59c72c0, jargc=jargc at entry=0, jargv=jargv at entry=0x5641c3cfa008 , appclassc=appclassc at entry=0, appclassv=appclassv at entry=0x0, fullversion=0x5641c3cf8088 "17-internal+0-adhoc.azul.crac-git", dotversion=0x5641c3cf8079 "0.0", pname=0x5641c3cf8074 "java", lname=0x5641c3cf806c "openjdk", javaargs=0 '\000', cpwildcard=1 '\001', javaw=0 '\000', ergo=0) at ../src/java.base/share/native/libjli/java.c:342 #6 0x00005641c3cf730a in main (argc=, argv=) at ../src/java.base/share/native/launcher/main.c:282 ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1544091438 From akozlov at openjdk.org Thu May 11 14:39:27 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 11 May 2023 14:39:27 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v10] In-Reply-To: References:

Message-ID: <2Lfs9rZauU41wjDfYBTKBI-C8QVFCZyLFpw125nMa2M=.617a691c-ae7b-4ffe-b1be-143505718f7a@github.com> On Wed, 10 May 2023 16:06:45 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Fixup > - Merge branch 'context_order' into newfd > - Revert removing the logging configuration > - Make comparator private > - Fix build > - Minified the set of changes > - Fixup > - Remove mention of single-threaded notifications > - Update docs > - ... and 41 more: https://git.openjdk.org/crac/compare/ef2437e7...8fd3566c Thanks for making the progress in this! And make things moving :) ------------- PR Comment: https://git.openjdk.org/crac/pull/43#issuecomment-1544103294 From duke at openjdk.org Thu May 11 14:43:31 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 14:43:31 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com>

Message-ID: On Thu, 13 Apr 2023 15:45:50 GMT, Anton Kozlov wrote: >> @AntonKozlov >> >>> Crac-criu does not use restore timens [1] since once a bug in kernel or criu caused timedwait to return immediatelly everytime that is called after restore. I don't remember the bug exactly (already fixed), but I believe it was discussed on this maillist >> >> https://github.com/CRaC/criu/commit/1cb2f4a518a4ae471a1df7a9b540203c1efaf1ba commit is dated July 14, 2020, but the crac-dev archives has earliest mailing list from Sept 2021. Is there some other mailing list this was discussed on? I am interested in understanding the problem that prompted not to use timens in criu. >> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. >> >>> In general, we should not to depend on very obscure linux abillities, as this reduce chances we'd be able to run on something rather than linux. >> >> I don't think timens can be put in the category of obscure linux ability. It has even made its way into container runtime spec: https://github.com/opencontainers/runtime-spec/pull/1151. > >> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. > > AFAIK the bug is fixed, but I see no point of relying on OS here. Is there one? Timens that is not changed by CRIU provides correct values for our nanoTime() [1]. > >> The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine > > [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/System.html#nanoTime() Upstream criu does provide the time namespace as stated by @AntonKozlov: CLOCK_MONOTONIC=301.735134591 CLOCK_BOOTTIME=301.735155494 CLOCK_MONOTONIC=302.735345917 CLOCK_BOOTTIME=302.735358109 Warn (compel/arch/x86/src/lib/infect.c:356): Will restore 7757 with interrupted system call [1]+ Killed ./clock_gettime restore: CLOCK_MONOTONIC=302.803360137 CLOCK_BOOTTIME=302.803373299 restore: CLOCK_MONOTONIC=302.806677876 CLOCK_BOOTTIME=302.806696098 I do not see why JVM should reimplement what CRIU already does. One can solve that in the future when CRaC is really going to run on non-Linux system. One will need to port or reimplement there CRIU in the first place. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1544109680 From duke at openjdk.org Thu May 11 14:43:35 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 14:43:35 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References:

Message-ID: On Thu, 11 May 2023 14:25:49 GMT, Jan Kratochvil wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use image under ghcr.io/crac > > test/jdk/jdk/crac/java/lang/System/NanoTimeTest.java line 82: > >> 80: "-e", "LD_PRELOAD=/opt/path-mapping-quiet.so", >> 81: "-e", "PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id", >> 82: CracBuilder.CONTAINER_NAME, CracBuilder.DOCKER_JAVA); > > On Fedora 36 x86_64 the testcase does not work for me: > > Starting docker container: > docker run --rm -d --privileged --init --volume /home/azul/azul/crac-git/JTwork/classes/jdk/crac/java/lang/System/NanoTimeTest.d:/cp/0 --volume /home/azul/azul/crac-git/JTwork/classes/test/lib:/cp/1 --volume cr:/cr --volume /home/azul/azul/crac-git/build/linux-x86_64-server-fastdebug/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test -v /tmp/NanoTimeTest-3201524983642970594-boot_id:/fake_boot_id jdk-internal:test-system-nanotime sleep 3600 > Starting process to be checkpointed: > docker exec -e LD_PRELOAD=/opt/path-mapping-quiet.so -e PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id crac-test /jdk/bin/java -ea -cp /cp/0:/cp/1: -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ NanoTimeTest 0 true > /criu: error while loading shared libraries: libbsd.so.0: cannot open shared object file: No such file or directory > Exception in thread "main" jdk.crac.CheckpointException Are you using your own build of CRIU? Looks like you have CRIU build with libbsd support. When you remove `libbsd-devel` and rebuild CRIU this should work. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191307243 From duke at openjdk.org Thu May 11 15:09:21 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 15:09:21 GMT Subject: [crac] Integrated: Improved open file descriptors tracking In-Reply-To: References: Message-ID: <3tCebXARUaioV_D2GWZg7GPnugyz5940VFHZtg3jFPc=.f3b1808f-9501-4284-a1db-ee153b347f14@github.com> On Tue, 31 Jan 2023 10:25:36 GMT, Radim Vansa wrote: > Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. > File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. This pull request has now been integrated. Changeset: 4b0dc2dc Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/4b0dc2dc9722945579c9772b335a44fa79f1729f Stats: 999 lines in 33 files changed: 682 ins; 276 del; 41 mod Improved open file descriptors tracking Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/43 From duke at openjdk.org Thu May 11 15:18:17 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 15:18:17 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References:

Message-ID: On Thu, 11 May 2023 14:54:58 GMT, Radim Vansa wrote: >> test/jdk/jdk/crac/java/lang/System/NanoTimeTest.java line 82: >> >>> 80: "-e", "LD_PRELOAD=/opt/path-mapping-quiet.so", >>> 81: "-e", "PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id", >>> 82: CracBuilder.CONTAINER_NAME, CracBuilder.DOCKER_JAVA); >> >> On Fedora 36 x86_64 the testcase does not work for me: >> >> Starting docker container: >> docker run --rm -d --privileged --init --volume /home/azul/azul/crac-git/JTwork/classes/jdk/crac/java/lang/System/NanoTimeTest.d:/cp/0 --volume /home/azul/azul/crac-git/JTwork/classes/test/lib:/cp/1 --volume cr:/cr --volume /home/azul/azul/crac-git/build/linux-x86_64-server-fastdebug/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test -v /tmp/NanoTimeTest-3201524983642970594-boot_id:/fake_boot_id jdk-internal:test-system-nanotime sleep 3600 >> Starting process to be checkpointed: >> docker exec -e LD_PRELOAD=/opt/path-mapping-quiet.so -e PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id crac-test /jdk/bin/java -ea -cp /cp/0:/cp/1: -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ NanoTimeTest 0 true >> /criu: error while loading shared libraries: libbsd.so.0: cannot open shared object file: No such file or directory >> Exception in thread "main" jdk.crac.CheckpointException > > Are you using your own build of CRIU? Looks like you have CRIU build with libbsd support. When you remove `libbsd-devel` and rebuild CRIU this should work. Yes. When I remove `libbsd-devel` I get: /usr/bin/ld: /home/azul/azul/criu-git/criu/apparmor.c:127: undefined reference to `strlcpy' /home/azul/azul/criu-git/criu/crtools.c:130: undefined reference to `setproctitle_init' Besides that no package should break (despite only its testcase) when some additional system feature is available. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191337406 From akozlov at openjdk.org Thu May 11 16:11:51 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 11 May 2023 16:11:51 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References:

Message-ID: On Wed, 10 May 2023 12:49:03 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert removing the logging configuration I got this after some unrelated modifications (FileDescriptor.beforeCheckpoint uses lambda), and I apparently get auto-deadlock with a single thread involved: "main" #1 prio=5 os_prio=0 cpu=88.95ms elapsed=21.61s tid=0x00007fd670025160 nid=0x27228a in Object.wait() [0x00007fd6747fd000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base at 17-internal/Native Method) - waiting on <0x0000000418002088> (a jdk.internal.crac.JDKContext) at java.lang.Object.wait(java.base at 17-internal/Object.java:338) at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) - locked <0x0000000418002088> (a jdk.internal.crac.JDKContext) at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:97) at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:170) at java.lang.ref.Cleaner.register(java.base at 17-internal/Cleaner.java:220) at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:90) at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:270) at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(java.base at 17-internal/MethodHandleNatives.java:281) at java.lang.invoke.MethodHandleNatives.linkCallSite(java.base at 17-internal/MethodHandleNatives.java:271) at java.io.FileDescriptor$Resource.beforeCheckpoint(java.base at 17-internal/FileDescriptor.java:74) at jdk.crac.impl.PriorityContext$SubContext.invokeBeforeCheckpoint(java.base at 17-internal/PriorityContext.java:107) at jdk.crac.impl.OrderedContext.runBeforeCheckpoint(java.base at 17-internal/OrderedContext.java:70) at jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:81) at jdk.crac.impl.AbstractContextImpl.invokeBeforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:41) at jdk.crac.impl.PriorityContext.runBeforeCheckpoint(java.base at 17-internal/PriorityContext.java:70) at jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:81) at jdk.internal.crac.JDKContext.beforeCheckpoint(java.base at 17-internal/JDKContext.java:85) at jdk.crac.impl.AbstractContextImpl.invokeBeforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:41) at jdk.crac.impl.OrderedContext.runBeforeCheckpoint(java.base at 17-internal/OrderedContext.java:70) at jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:81) at jdk.crac.Core.checkpointRestore1(java.base at 17-internal/Core.java:116) at jdk.crac.Core.checkpointRestore(java.base at 17-internal/Core.java:256) - locked <0x0000000418002118> (a java.lang.Object) at jdk.crac.Core.checkpointRestore(java.base at 17-internal/Core.java:241) at CheckpointWithOpenFdsTest.exec(CheckpointWithOpenFdsTest.java:49) at jdk.test.lib.crac.CracTest.run(CracTest.java:157) at jdk.test.lib.crac.CracTest.main(CracTest.java:89) ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1544268184 From duke at openjdk.org Fri May 12 02:00:13 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Fri, 12 May 2023 02:00:13 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References:

Message-ID: On Thu, 11 May 2023 15:15:23 GMT, Jan Kratochvil wrote: >> Are you using your own build of CRIU? Looks like you have CRIU build with libbsd support. When you remove `libbsd-devel` and rebuild CRIU this should work. > > Yes. When I remove `libbsd-devel` I get: > > /usr/bin/ld: /home/azul/azul/criu-git/criu/apparmor.c:127: undefined reference to `strlcpy' > /home/azul/azul/criu-git/criu/crtools.c:130: undefined reference to `setproctitle_init' > > Besides that no package should break (despite only its testcase) when some additional system feature is available. Besides discussion of this patch vs. native CRIU timens support I find easier (YMMV) and less error-prone to just use simple `LD_PRELOAD`+`RTLD_NEXT` to simulate the system reboots: https://stackoverflow.com/a/6083624/2995591 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191835421 From duke at openjdk.org Fri May 12 05:49:28 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 05:49:28 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v7] In-Reply-To: References: Message-ID: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: - Merge branch 'crac' into nanotime - Fix whitespaces - Use image under ghcr.io/crac - Ensure monotonicity for the same boot - Set nanotime only if bootid changes - Reset nanotime offset before calculating it again - Correct time since restore - Merge remote-tracking branch 'origin/crac' into nanotime - Adjust System.nanoTime() to keep consistent time origin after restore - Merge remote-tracking branch 'origin/crac' into test-crac-java - ... and 8 more: https://git.openjdk.org/crac/compare/ef2437e7...87d19a12 ------------- Changes: https://git.openjdk.org/crac/pull/53/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=53&range=06 Stats: 295 lines in 7 files changed: 268 ins; 0 del; 27 mod Patch: https://git.openjdk.org/crac/pull/53.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/53/head:pull/53 PR: https://git.openjdk.org/crac/pull/53 From duke at openjdk.org Fri May 12 06:04:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 06:04:12 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References:

Message-ID: On Fri, 12 May 2023 01:56:58 GMT, Jan Kratochvil wrote: >> Yes. When I remove `libbsd-devel` I get: >> >> /usr/bin/ld: /home/azul/azul/criu-git/criu/apparmor.c:127: undefined reference to `strlcpy' >> /home/azul/azul/criu-git/criu/crtools.c:130: undefined reference to `setproctitle_init' >> >> Besides that no package should break (despite only its testcase) when some additional system feature is available. > > Besides discussion of this patch vs. native CRIU timens support I find easier (YMMV) and less error-prone to just use simple `LD_PRELOAD`+`RTLD_NEXT` to simulate the system reboots: https://stackoverflow.com/a/6083624/2995591 I agree with your line of thinking, and since this test actually uses custom docker image I can add those libraries to the base image. But in general it's a problem of criu packaging with implicit dependencies, officially installed binary should get this handled in dependency management. Since those few functions from libbsd are reimplemented in criu, we could just disable the dependency in our fork. However we'll run into the same problem with libnftables - I found that with the nftables support the unprivileged restore as root in Docker container does not work anymore. And that got compiled in only because of the -dev library present, there's no compile time switch. The error you describe can be solved with `git clean -f -x`, regrettably `make clean` does not do its job very well. Not sure what function are you suggesting to replace? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191938800 From duke at openjdk.org Fri May 12 06:48:24 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 06:48:24 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References:

Message-ID: On Thu, 11 May 2023 16:04:32 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert removing the logging configuration > > I got this after some unrelated modifications (FileDescriptor.beforeCheckpoint uses lambda), and I apparently get auto-deadlock with a single thread involved: > > > "main" #1 prio=5 os_prio=0 cpu=88.95ms elapsed=21.61s tid=0x00007fd670025160 nid=0x27228a in Object.wait() [0x00007fd6747fd000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base at 17-internal/Native Method) > - waiting on <0x0000000418002088> (a jdk.internal.crac.JDKContext) > at java.lang.Object.wait(java.base at 17-internal/Object.java:338) > at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) > at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) > - locked <0x0000000418002088> (a jdk.internal.crac.JDKContext) > at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:97) > at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:170) > at java.lang.ref.Cleaner.register(java.base at 17-internal/Cleaner.java:220) > at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:90) > at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) > at java.lang.invoke.ConstantCallSite.