From duke at openjdk.org Tue May 2 06:40:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 06:40:45 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 11:51:19 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/PriorityContext.java line 21: > >> 19: // CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) >> 20: // ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE >> 21: // POSSIBILITY OF SUCH DAMAGE. > > Use standard copyright please Uh, OK - I've copy pasted what's in the AbstractContextImpl as I've forked the class... ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182142187 From duke at openjdk.org Tue May 2 06:48:42 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 06:48:42 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 11:59:17 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 59: > >> 57: recordExceptions(e); >> 58: } catch (Exception e) { >> 59: Core.recordException(e); > > Why is there is the distinction? I think we should throw all exceptions from the context, rather than publishing them to a central store, otherwise the parent Context (if any), won't be able to do anything about those. The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. We could have a method to check whether this C/R is 'marked for rollback' (has any exceptions), but I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. About the distinction: the difference is that if you get a CheckpointException you'd unwrap it, recording only the inner suppressed ones. But I should push that to `recordExceptions` and rather decide based on CheckpointException message than number of suppressions. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182150531 From duke at openjdk.org Tue May 2 07:06:01 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:06:01 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: <3Y237gR4g69wXdE-awYTVqazWTXKFYNsaJSjwEeVWbA=.56181d61-a6bd-4706-8a12-adf5cf2fff7b@github.com> On Fri, 28 Apr 2023 12:01:42 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 78: > >> 76: } >> 77: >> 78: protected abstract void runBeforeCheckpoint(); > > This is intended to be overwritten (becomes a part of the class interface). The intent behind the separate method is not evident. Corresponding runAfterRestore is private though. > > After AbstractContexImpl has lost parameter P and comparator, a distinction between AbstractContexImpl and OrderedContext has been lost. Merging AbstractContexImpl into OrderedContext likely will provided clearer code. ACI is implemented both by OrderedContext and PriorityContext, while PC is quite different from OC. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182164379 From duke at openjdk.org Tue May 2 07:09:46 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:09:46 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 12:06:19 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 55: > >> 53: restoreQ.add(resource); >> 54: try { >> 55: resource.beforeCheckpoint(semanticContext()); > > Does this mean a Resource may get another Context and not the one to which it has been registered? This may be very unexpected for the Resource implementation. Theoretically, the method could do that. However, here the purpose of `semanticContext()` is to pass the context to which it was registered rather than the subcontext where it is stored (but this is an implementation detail that the resource does not know about). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182166930 From duke at openjdk.org Tue May 2 07:23:49 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:23:49 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 12:58:59 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/OrderedContext.java line 60: > >> 58: // It is possible that something registers to us during restore but before >> 59: // this context's afterRestore was called. >> 60: if (checkpointing && !Core.isRestoring()) { > > There is a small window between all beforeCheckpoint() are finished and checkpoint. In this window we'll call setModified(). An there is another window between restore and afterRestore() processing is started, where we'll won't call setModified(). Getting the exception or not will be a result of a race between checkpoint/restore (actual event with near-zero duration, without calling Resources) and registration. > > A Resource may also have an empty beforeCheckpoint() and some afterRestore() clean up. We'll register the resource for the next round of checkpoint/restore and will be silence about newly registered Resource. But since beforeCheckpoint() is empty, the original intent could be to do something useful on restore, which won't be done. Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified). If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182178695 From duke at openjdk.org Tue May 2 07:27:48 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:27:48 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 13:14:20 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/Core.java line 104: > >> 102: * Order of invoking {@link Resource#afterRestore(Context)} is the reverse >> 103: * of the order of {@link Resource#beforeCheckpoint(Context) checkpoint notification}, >> 104: * hence the same as the order of {@link Context#register(Resource) registration}. > > How about moving the Global Context description from the package level here (removing there). In javax.crac it should be fine to link to here IMO. Okay, I can remove the description in package. As for `javax.crac` - I thought that this should be really a mirror of `jdk.crac`, why the distinction? Another way might be to make OrderedContext a marker interface (move the implementation to OrderedContextImpl) and put the description there, using this interface for `getGlobalContext()`. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182182075 From duke at openjdk.org Tue May 2 07:41:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 07:41:44 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore In-Reply-To: References: Message-ID: On Fri, 28 Apr 2023 14:56:42 GMT, Anton Kozlov wrote: >> When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. > > src/hotspot/os/linux/os_linux.cpp line 6557: > >> 6555: ::_restore_start_counter = hdr->_restore_counter; >> 6556: >> 6557: for (int i = 0; i < hdr->_nflags; i++) { > > This check can be done in the "bootstrap" process (the one that execs to CREngine): just to avoid restoring and finding out the problem. See the other comment about producing the error. Makes sense, I'll turn those into errors. I should probably also check the presence of `-jar` and `-cp`/`--classpath` and produce a nice explanation; otherwise the code would interpret those as the new main class and its arguments. > src/hotspot/share/runtime/globals.hpp line 2096: > >> 2094: /* It does not make sense to change this flag in runtime but we'll tag */ \ >> 2095: /* it MANAGEABLE to prevent warnings when setting this on restore. */ \ >> 2096: product(ccstr, CRaCRestoreFrom, NULL, MANAGEABLE, \ > > This is an example why we want "can be set on restore" (RESTOREBLE?) flag. So MANAGABLE will be implying RESTORABLE, but not every RESTORABLE will be MANAGEABLE. Below we see that `CRaCIgnoredFileDescriptors` is RESTORABLE (I would rather use SET_ON_RESTORE as all flags are restored at its previous value) but not MANAGEABLE, so that's not perfect either. The reason why I rather stayed on MANAGEABLE was to prevent changing every `MANAGEABLE` to `MANAGEABLE | SET_ON_RESTORE`, that would complicate any backport from mainline. I can do the SET_ON_RESTORE (as superset of MANAGEABLE) but I don't think that the few exceptions that could be rather documented by a few lines of comment really deserve a separate flag. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1182193699 PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1182191415 From akozlov at openjdk.org Tue May 2 09:45:47 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 09:45:47 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Tue, 2 May 2023 06:45:45 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 59: >> >>> 57: recordExceptions(e); >>> 58: } catch (Exception e) { >>> 59: Core.recordException(e); >> >> Why is there is the distinction? I think we should throw all exceptions from the context, rather than publishing them to a central store, otherwise the parent Context (if any), won't be able to do anything about those. > > The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. We could have a method to check whether this C/R is 'marked for rollback' (has any exceptions), but I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. > > About the distinction: the difference is that if you get a CheckpointException you'd unwrap it, recording only the inner suppressed ones. But I should push that to `recordExceptions` and rather decide based on CheckpointException message than number of suppressions. > The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. ... I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. The parent Context may implement an artbitrary handling (for example, unloading a component compleltely if that is throwing an exception). So throwing an exception is something useful. With that, the new Core.recordException is completely new exception flow, that just opimizes somthing the generic throw scheme. With that, the generic schemes should be something good enough already, we don't need to complicate the interface, the code,.. CheckpointExcepotions are still exceptions, that is, we don't expect them often, there is no need to optimize them. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182328171 From akozlov at openjdk.org Tue May 2 10:06:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 10:06:43 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Tue, 2 May 2023 07:06:48 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 55: >> >>> 53: restoreQ.add(resource); >>> 54: try { >>> 55: resource.beforeCheckpoint(semanticContext()); >> >> Does this mean a Resource may get another Context and not the one to which it has been registered? This may be very unexpected for the Resource implementation. > > Theoretically, the method could do that. However, here the purpose of `semanticContext()` is to pass the context to which it was registered rather than the subcontext where it is stored (but this is an implementation detail that the resource does not know about). Just realized that this is required for PriorityContext implementation. But that is the implementation of that class, it's wrong ACI has to care about that. >> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 78: >> >>> 76: } >>> 77: >>> 78: protected abstract void runBeforeCheckpoint(); >> >> This is intended to be overwritten (becomes a part of the class interface). The intent behind the separate method is not evident. Corresponding runAfterRestore is private though. >> >> After AbstractContexImpl has lost parameter P and comparator, a distinction between AbstractContexImpl and OrderedContext has been lost. Merging AbstractContexImpl into OrderedContext likely will provided clearer code. > > ACI is implemented both by OrderedContext and PriorityContext, while PC is quite different from OC. I'm trying to describe class hierachy and failing. The patch tries to reverse ACI-PC relation. ACI was for partially-ordered Resources (defined by a Comparator), and now it's for totally-ordered Resources (ordered by long). Trying to fit the partially-ordering PC as a subclass of the totally-ordering ACI feels unnatural. Can we have a cleaner hierachy of the classes? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182349575 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182347913 From duke at openjdk.org Tue May 2 10:30:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 10:30:44 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Tue, 2 May 2023 10:02:34 GMT, Anton Kozlov wrote: >> ACI is implemented both by OrderedContext and PriorityContext, while PC is quite different from OC. > > I'm trying to describe class hierachy and failing. The patch tries to reverse ACI-PC relation. ACI was for partially-ordered Resources (defined by a Comparator), and now it's for totally-ordered Resources (ordered by long). Trying to fit the partially-ordering PC as a subclass of the totally-ordering ACI feels unnatural. Can we have a cleaner hierachy of the classes? No; ACI was totally-ordered in the previous version of the PR, but now it doesn't care about ordering at all; it's the abstract `runBeforeCheckpoint` that decides on the order. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182370220 From duke at openjdk.org Tue May 2 10:44:48 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 10:44:48 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: <50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com> On Tue, 2 May 2023 10:04:21 GMT, Anton Kozlov wrote: >> Theoretically, the method could do that. However, here the purpose of `semanticContext()` is to pass the context to which it was registered rather than the subcontext where it is stored (but this is an implementation detail that the resource does not know about). > > Just realized that this is required for PriorityContext implementation. But that is the implementation of that class, it's wrong ACI has to care about that. Yes, it's a bit of enforced flexibility of the base class (through allowing to override a method), though it doesn't need about the use case. I could use just a list rather than sub-contexts but that would require duplicated code. >> The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. We could have a method to check whether this C/R is 'marked for rollback' (has any exceptions), but I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. >> >> About the distinction: the difference is that if you get a CheckpointException you'd unwrap it, recording only the inner suppressed ones. But I should push that to `recordExceptions` and rather decide based on CheckpointException message than number of suppressions. > >> The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. ... I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. > > The parent Context may implement an artbitrary handling (for example, unloading a component compleltely if that is throwing an exception). So throwing an exception is something useful. > > With that, the new Core.recordException is completely new exception flow, that just opimizes somthing the generic throw scheme. With that, the generic schemes should be something good enough already, we don't need to complicate the interface, the code,.. CheckpointExcepotions are still exceptions, that is, we don't expect them often, there is no need to optimize them. It's not about optimization (in the sense of performance) but about removing code bloat, and the need for parent context to tediously copy failures (deciding whether something was a wrapper exception or the actual failure), when all you need to do in the end is to report them in bulk. It's not a new exception flow, it's removing the flow as there is no exception flow needed. Your example is invalid: If a throwing resource is to be removed, it's the task of the parent context - and that one will see the exception. The parent context should not remove its child context since one of the N resources in the child context is failing. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182382341 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182379465 From akozlov at openjdk.org Tue May 2 10:57:41 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 10:57:41 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Tue, 2 May 2023 07:21:24 GMT, Radim Vansa wrote: > Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. Suppose aR() has some really important side-effect, it's totally necessarily to run that on restore. Then it falls to the category of problems this PR tries to solve (silently ignoring registered resources). > If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified). > > If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently. Some form of the guarantee will be good. The blocking registration, I assume it something like register() to finish only when the argument is successfully registered? This looks like a viable approach, e.g. it does specify the behavior around possible race and do not affect "normal" workflow when registration happens way before the checkpoint. Do you see any problem with blocking registration? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182393741 From akozlov at openjdk.org Tue May 2 11:38:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 11:38:43 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore In-Reply-To: References: Message-ID: On Tue, 2 May 2023 07:35:57 GMT, Radim Vansa wrote: > The reason why I rather stayed on MANAGEABLE was to prevent changing every `MANAGEABLE` to `MANAGEABLE | SET_ON_RESTORE`, that would complicate any backport from mainline. I tottaly support MANAGEABLE => RESTORABLE, as it makes sense. > I can do the SET_ON_RESTORE (as superset of MANAGEABLE) but I don't think that the few exceptions that could be rather documented by a few lines of comment really deserve a separate flag. Option are also some form of the documentation for the flags. So the new class of options deserve their own name. > Below we see that `CRaCIgnoredFileDescriptors` is RESTORABLE but not MANAGEABLE, so that's not perfect either. Yes, that is another one. And there are many more, like PrintCompilation. Or any other product flag which handling in the VM allows update on restore. Over time the set of RESTORABLE flags can grow as VM implementation allows. While there is a higher bar to include a flag into MANGEABLE set [1] [1] https://github.com/openjdk/crac/blob/crac/src/hotspot/share/runtime/globals.hpp#L86 > (I would rather use SET_ON_RESTORE as all flags are restored at its previous value) I don't like RESTORABLE as well (relates to RESTORE, but how is not clear). The new name better to fit into existing set of: DIAGNOSTIC, EXPERIMENTAL, or MANAGEABLE. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1182430278 From duke at openjdk.org Tue May 2 12:31:51 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 2 May 2023 12:31:51 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Tue, 2 May 2023 10:54:51 GMT, Anton Kozlov wrote: >> Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified). >> >> If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently. > >> Does that matter, though? The registration itself will be done for next C/R in all cases. The situation where the registrar entered the `register()` method, added the resource and recorded an exception but this has been silently discarded is equivalent to the situation where it entered the `register()` just after restore was completed. > > Suppose aR() has some really important side-effect, it's totally necessarily to run that on restore. Then it falls to the category of problems this PR tries to solve (silently ignoring registered resources). > >> If these are independent you cannot force it to fail reliably, unless there is something in running the resource notifications that depends on the registration (but it's not as all resources have been notified). >> >> If the intend was to do something after the current restore the resource should have been registered in a blocking way before the checkpoint or as part of the blocking beforeCheckpoint, not absolutely independently. > > Some form of the guarantee will be good. The blocking registration, I assume it something like register() to finish only when the argument is successfully registered? This looks like a viable approach, e.g. it does specify the behavior around possible race and do not affect "normal" workflow when registration happens way before the checkpoint. Do you see any problem with blocking registration? I don't think we understand each other. Let's say you have a code like this: new Thread(() -> { Resource another = /* ... */; Core.getGlobalContext().register(another); }).start(); Core.checkpointRestore(); You insisted on `register()` that does not throw. What implementation could ensure that something eventually makes `Core.checkpointRestore()` throw? There's no guarantee that this will run before the checkpoint completes; the code does not order these in any way. The only result the user can expect is that the resource will be registered, eventually. Had you added a `CountDownLatch` triggered after calling `register()` and waited for at last somewhere in one of the `beforeCheckpoint` methods the race would not happen. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1182481551 From akozlov at openjdk.org Tue May 2 14:03:17 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 14:03:17 GMT Subject: [crac] RFR: Backout new API to sync with Reference Handler [v3] In-Reply-To: References: Message-ID: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> > This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. > > The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. > > I think that API was a mistake and should be reverted. > > In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. > > The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. Anton Kozlov has updated the pull request incrementally with three additional commits since the last revision: - Bring back the test - Use in-place Resource - Bring back parts of the commit ------------- Changes: - all: https://git.openjdk.org/crac/pull/34/files - new: https://git.openjdk.org/crac/pull/34/files/63bc1847..73870426 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=34&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=34&range=01-02 Stats: 128 lines in 4 files changed: 126 ins; 0 del; 2 mod Patch: https://git.openjdk.org/crac/pull/34.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/34/head:pull/34 PR: https://git.openjdk.org/crac/pull/34 From akozlov at openjdk.org Tue May 2 16:21:44 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 16:21:44 GMT Subject: [crac] RFR: Backout new API to sync with Reference Handler [v3] In-Reply-To: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> References: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> Message-ID: On Tue, 2 May 2023 14:03:17 GMT, Anton Kozlov wrote: >> This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. >> >> The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. >> >> I think that API was a mistake and should be reverted. >> >> In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. >> >> The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. > > Anton Kozlov has updated the pull request incrementally with three additional commits since the last revision: > > - Bring back the test > - Use in-place Resource > - Bring back parts of the commit Turns out we cannot avoid synchronizing Cleaners with the checkpoint, otherwise, some native resources may remain open, if they are released as a result of running a Cleaner. This happens with JarFileFactory, which maintains a cache of URLJarFiles. To avoid their tracking, a lightweight GC-based tracking is implemented [1], that only requires that unused entries are reachable only from the cache. But it relies on Cleaner to run all the actions. The latest version declares PhantomCleanableRef's to be Resources, which triggers clean() on the checkpoint, rather than waiting for the ref to be processed by a reference processor thread. This preserves Cleaner behavior w.r.t. checkpoint, so the test is also retained. [1] https://github.com/openjdk/crac/blob/95394e84683f1a816c0283f8c834072324516fba/src/java.base/unix/classes/sun/net/www/protocol/jar/JarFileFactory.java#L255 ------------- PR Comment: https://git.openjdk.org/crac/pull/34#issuecomment-1531754292 From akozlov at openjdk.org Tue May 2 17:15:45 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 2 May 2023 17:15:45 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v9] In-Reply-To: References: Message-ID: <2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> On Mon, 24 Apr 2023 13:27:30 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Provide more information for file descriptors src/hotspot/os/linux/os_linux.cpp line 6092: > 6090: int ilen = snprintf(msg, maxinfo, "FD fd=%d type=%s path=", i, type); > 6091: ilen = ilen > maxinfo ? maxinfo : ilen; > 6092: strncpy(msg + ilen, detailsbuf, buflen - ilen); `ilen >= maxinfo` if the output was truncated [1] `strncpy` may leave the string unterminated if `buflen-ilen` smaller than details. So `snprintf(msg, maxinfo, "FD fd=%d type=%s path=%s", i, type, details)` will be a safer option. [1] RETURN in https://linux.die.net/man/3/snprintf src/java.base/share/classes/java/io/FileDescriptor.java line 400: > 398: } else { > 399: info = (path != null ? path : "unknown path") + " (" + (type != null ? type : "unknown") + ")"; > 400: } This have too many socket-related details, also a number of java/native transitions that will be unavoidable if we adopt the proposed interface. src/java.base/share/classes/java/net/Socket.java line 1939: > 1937: * @return Textual representation of the type. > 1938: */ > 1939: public static native String getType(int fd); `int fd` is a Unix platfrom detail. I propose a single function `public static String getDescription(FileDescriptor socket)`. That e.g. returns `null` if the FileDescriptor is not a socket. The method can be a native or not, depends on the implementation. src/java.base/unix/native/libnet/SocketImpl.c line 115: > 113: return NULL; > 114: } > 115: } No need for `} else {` here and everywhere else since the previous block has anyway terminated the function. This will make the code more streamlined. Suggestion: } localAddr = create_isa(env, isa_class, isa_ctor, &local); if (localAddr == NULL) { JNU_ThrowOutOfMemoryError(env, "java.net.InetSocketAddres"); return NULL; } ------------- PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1182827339 PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1182802505 PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1182794395 PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1182815623 From duke at openjdk.org Wed May 3 06:54:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 06:54:38 GMT Subject: [crac] RFR: Backout new API to sync with Reference Handler [v3] In-Reply-To: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> References: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> Message-ID: On Tue, 2 May 2023 14:03:17 GMT, Anton Kozlov wrote: >> This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. >> >> The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. >> >> I think that API was a mistake and should be reverted. >> >> In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. >> >> The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. > > Anton Kozlov has updated the pull request incrementally with three additional commits since the last revision: > > - Bring back the test > - Use in-place Resource > - Bring back parts of the commit I've commented out the `PhantomCleanableRef` registration for a test and in one case the test passed (other attempts failed) - I wish we could have the test failing more reliably. Nevertheless LGTM. ------------- PR Comment: https://git.openjdk.org/crac/pull/34#issuecomment-1532527018 From duke at openjdk.org Wed May 3 09:45:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 09:45:09 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Use RESTORE_SETTABLE on JVM flags * Fail early when using non settable flags * CracBuilder fix: don't use classpath during restore ------------- Changes: - all: https://git.openjdk.org/crac/pull/61/files - new: https://git.openjdk.org/crac/pull/61/files/b2a73eb9..21ec5e80 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=61&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=61&range=00-01 Stats: 126 lines in 8 files changed: 64 ins; 10 del; 52 mod Patch: https://git.openjdk.org/crac/pull/61.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/61/head:pull/61 PR: https://git.openjdk.org/crac/pull/61 From duke at openjdk.org Wed May 3 09:47:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 09:47:44 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore In-Reply-To: References: Message-ID: On Fri, 28 Apr 2023 07:24:06 GMT, Radim Vansa wrote: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. @AntonKozlov Created a separate bit (RESTORE_SETTABLE - adjective as the others), and using non-settable flags fails early. I've also tried to prohibit explicitly setting `-cp` and friends but in the `parse_options_for_restore` I already cannot tell where this comes from so I have not incorporated this in this PR. There's a few extra changes in `CracBuilder` that would help with testing ^ but I find them useful in general, so I've included those. ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1532735769 From akozlov at openjdk.org Wed May 3 09:57:49 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 3 May 2023 09:57:49 GMT Subject: [crac] RFR: Backout new API to sync with Reference Handler [v3] In-Reply-To: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> References: <-y7D8kIbwPs5UJx2iOrmXhahEEraCfG7r46yCbGdpjs=.882bb53e-a8d0-4b32-b1de-19116c37f325@github.com> Message-ID: On Tue, 2 May 2023 14:03:17 GMT, Anton Kozlov wrote: >> This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. >> >> The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. >> >> I think that API was a mistake and should be reverted. >> >> In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. >> >> The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. > > Anton Kozlov has updated the pull request incrementally with three additional commits since the last revision: > > - Bring back the test > - Use in-place Resource > - Bring back parts of the commit Thanks for the review! ------------- PR Comment: https://git.openjdk.org/crac/pull/34#issuecomment-1532747527 From akozlov at openjdk.org Wed May 3 09:57:49 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 3 May 2023 09:57:49 GMT Subject: [crac] Integrated: Backout new API to sync with Reference Handler In-Reply-To: References: Message-ID: On Thu, 10 Nov 2022 15:34:23 GMT, Anton Kozlov wrote: > This reverts commit 9cf1995693eead85d3807fb4c83ab38c14e27042 and makes #22 obsolete. > > The API introduced in 9cf1995693eead85d3807fb4c83ab38c14e27042 (waitForWaiters) and changed in #22 waits for the state when all discovered references are processed. So WaitForWaiters is used to implement predictable Reference Handling, ensuring that clean-up actions have fired for an object after it becomes unreachable. > > I think that API was a mistake and should be reverted. > > In general, the problem of predictable Reference Handling is independent of CRaC. So I thought about extracting that out of CRaC and found a few issues with the approach. A user needs to know what RefQueue gets References after an object becomes unreachable, to call waitForWaiters on that queue. The queue is not necessarily evident, so a deep understanding of refs and queues in an application is required to select the proper queue to wait on, and to build the right order of them to wait on. Also, it's required somehow to know the number of threads servicing a queue. And there are situations when waitForWaiters may report that all refs are processed, but some of them are not -- consider a thread that is polling a queue and gets refs to be processed but then buffers them in another queue for later, in this example waitForWaiters does not provide the guarantee that corresponding clean-up actions were performed. > > The common and more straightforward way to have predictable clean-up is to call an explicit method like close()/release()/cleanup() that performs object-specific clean-up actions predictably. This pull request has now been integrated. Changeset: ccf33231 Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/ccf33231110c8e8dd3c47bae0a079d25a34ac8b5 Stats: 53 lines in 2 files changed: 19 ins; 32 del; 2 mod Backout new API to sync with Reference Handler ------------- PR: https://git.openjdk.org/crac/pull/34 From duke at openjdk.org Wed May 3 11:06:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 11:06:44 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 11:48:16 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 42: > >> 40: for (Throwable t : suppressed) { >> 41: Core.recordException(t); >> 42: } > > Unwrap Checkpoint/RestoreException only? This is how it's actually used; can't use union type in Java. I'll add an assertion to make this clear... ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1183542544 From akozlov at openjdk.org Wed May 3 11:28:38 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 3 May 2023 11:28:38 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: On Wed, 3 May 2023 09:45:09 GMT, Radim Vansa wrote: >> When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use RESTORE_SETTABLE on JVM flags > > * Fail early when using non settable flags > * CracBuilder fix: don't use classpath during restore Thanks, looks mostly good except few nits. Do you have an idea why jdk/crac/recursiveCheckpoint/Test.java has failed? src/hotspot/share/runtime/globals.hpp line 57: > 55: > 56: // The optional extra_attrs parameter may have one of the following values: > 57: // DIAGNOSTIC, EXPERIMENTAL, MANAGEABLE and RESTORE_SETTABLE. Currently ` and` -> `, or` src/hotspot/share/runtime/globals.hpp line 2094: > 2092: "Trace optimized upcall stub generation") \ > 2093: \ > 2094: product(ccstr, CRaCCheckpointTo, NULL, MANAGEABLE, \ I see reasons for CRaCCheckpointTo to be MANAGEABLE, but at the moment the flag is assumed to be set in the command line by the implementation, e.g. os::Linux::{vm_create_start,prepare_checkpoint} are called depending on the flag value, and that can happen only during VM initialization. A set of changes are required before the option can become MANAGEABLE. The test should also updated once the option is reverted. src/hotspot/share/runtime/globals.hpp line 2129: > 2127: "Throw CheckpointException if uncheckpointable resource handle found")\ > 2128: \ > 2129: product(bool, CRTrace, true, MANAGEABLE, "Minimal C/R tracing") \ RESTORE_SETTABLE was meant here? Please don't mix in MANAGEABLE flags into this PR if that was inteded. ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1532862561 PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1532864110 PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183538871 PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183553692 PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183485336 From akozlov at openjdk.org Wed May 3 11:28:40 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 3 May 2023 11:28:40 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: On Fri, 28 Apr 2023 15:07:41 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use RESTORE_SETTABLE on JVM flags >> >> * Fail early when using non settable flags >> * CracBuilder fix: don't use classpath during restore > > src/hotspot/os/linux/os_linux.cpp line 6579: > >> 6577: } >> 6578: if (result != JVMFlag::Error::SUCCESS) { >> 6579: warning("VM Option '%s' cannot be changed, ignoring: %s", > > A significant set of options cannot be set on restore at the moment, so it will be even better to highlight they don't have effect and produce an error. It may be useful to revert back to warning (with e.g. an option), but by default it should be disabled (leading to the error) The place can be `guarantee(result == JVMFlag::Error::SUCCESS, "...")` since the possibility was checked earlier. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183558035 From duke at openjdk.org Wed May 3 11:28:40 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 11:28:40 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: On Wed, 3 May 2023 10:03:40 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use RESTORE_SETTABLE on JVM flags >> >> * Fail early when using non settable flags >> * CracBuilder fix: don't use classpath during restore > > src/hotspot/share/runtime/globals.hpp line 2129: > >> 2127: "Throw CheckpointException if uncheckpointable resource handle found")\ >> 2128: \ >> 2129: product(bool, CRTrace, true, MANAGEABLE, "Minimal C/R tracing") \ > > RESTORE_SETTABLE was meant here? Please don't mix in MANAGEABLE flags into this PR if that was inteded. Looking for usages (actually only one) of the flag, it qualifies to be set at any time. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183560669 From duke at openjdk.org Wed May 3 12:40:43 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 3 May 2023 12:40:43 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: <8nZsMojvW82BDqhDm-UBSM4NVRXeLPeTseO1w93_U3w=.75769e3d-cabf-4b85-99e2-66271be65bf9@github.com> On Wed, 3 May 2023 11:17:17 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use RESTORE_SETTABLE on JVM flags >> >> * Fail early when using non settable flags >> * CracBuilder fix: don't use classpath during restore > > src/hotspot/share/runtime/globals.hpp line 2094: > >> 2092: "Trace optimized upcall stub generation") \ >> 2093: \ >> 2094: product(ccstr, CRaCCheckpointTo, NULL, MANAGEABLE, \ > > I see reasons for CRaCCheckpointTo to be MANAGEABLE, but at the moment the flag is assumed to be set in the command line by the implementation, e.g. os::Linux::{vm_create_start,prepare_checkpoint} are called depending on the flag value, and that can happen only during VM initialization. > > A set of changes are required before the option can become MANAGEABLE. > > The test should also updated once the option is reverted. Well spotted, I was thinking about changing the path but these two places need to be called in order to prepare for a checkpoint. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1183629613 From duke at openjdk.org Wed May 3 14:01:49 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Wed, 3 May 2023 14:01:49 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v13] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 > > That IMO does not preclude trying the same for this case. > > - Debian 11 x86_64: It does not work, glibc is too different and inlined there. > - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. > - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 80 commits: - Merge branch 'crac-altstack' into crac-altstack-tunables - Merge branch 'crac' into crac-altstack - Merge branch 'crac' into crac-altstack - 2b0f56b7: - ec18a208: - Fix the glibc SSE2 exception. - c446cae3: - Use CPU_FEATURE_ACTIVE. - Compatibility with old glibcs. - Fix a crash. - ... and 70 more: https://git.openjdk.org/crac/compare/ccf33231...9e6faf67 ------------- Changes: https://git.openjdk.org/crac/pull/41/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=41&range=12 Stats: 720 lines in 19 files changed: 697 ins; 3 del; 20 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Wed May 3 14:09:06 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Wed, 3 May 2023 14:09:06 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v13] In-Reply-To: References: Message-ID: On Wed, 3 May 2023 14:01:49 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 >> >> That IMO does not preclude trying the same for this case. >> >> - Debian 11 x86_64: It does not work, glibc is too different and inlined there. >> - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. >> - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. > > Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 80 commits: > > - Merge branch 'crac-altstack' into crac-altstack-tunables > - Merge branch 'crac' into crac-altstack > - Merge branch 'crac' into crac-altstack > - 2b0f56b7: > - ec18a208: > - Fix the glibc SSE2 exception. > - c446cae3: > - Use CPU_FEATURE_ACTIVE. > - Compatibility with old glibcs. > - Fix a crash. > - ... and 70 more: https://git.openjdk.org/crac/compare/ccf33231...9e6faf67 Here is a variant which should be compatible with any glibc version using: **GLIBC_TUNABLES=:[glibc.cpu.hwcaps](https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html)=...** > `./build/linux-x86_64-server-slowdebug/jdk/bin/java -XX:CPUFeatures=generic -XX:+ShowCPUFeatures Hello.java` - On older glibcs not supporting macro `CPU_FEATURE_ACTIVE` the disabling of glibc features has no effect (and it may crash the migration even after using `-XX:CPUFeatures=generic`). - Also with newer glibcs than what the OpenJDK/CRaC sources support it may crash due to some new glibc feature not disabled by OpenJDK/CRaC. - I have setup (for me) tracking of [sysdeps/x86/sys/platform/x86.h](https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/x86/sys/platform/x86.h;hb=HEAD) and [sysdeps/x86/bits/platform/x86.h](https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/x86/bits/platform/x86.h;hb=HEAD). ------------- PR Comment: https://git.openjdk.org/crac/pull/41#issuecomment-1533091216 From duke at openjdk.org Thu May 4 07:17:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 07:17:54 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v3] In-Reply-To: References: Message-ID: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Fixup ------------- Changes: - all: https://git.openjdk.org/crac/pull/61/files - new: https://git.openjdk.org/crac/pull/61/files/21ec5e80..75ce1b64 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=61&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=61&range=01-02 Stats: 16 lines in 3 files changed: 6 ins; 5 del; 5 mod Patch: https://git.openjdk.org/crac/pull/61.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/61/head:pull/61 PR: https://git.openjdk.org/crac/pull/61 From duke at openjdk.org Thu May 4 07:17:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 07:17:54 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: On Wed, 3 May 2023 11:26:13 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use RESTORE_SETTABLE on JVM flags >> >> * Fail early when using non settable flags >> * CracBuilder fix: don't use classpath during restore > > Do you have an idea why jdk/crac/recursiveCheckpoint/Test.java has failed? @AntonKozlov Updated. The test failed rightfully, I've stopped adding the `CREngine` flag but that breaks the restore with non-default engine. Now made the flag RESTORE_SETTABLE. ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1534200247 From duke at openjdk.org Thu May 4 07:17:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 07:17:54 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: On Wed, 3 May 2023 09:45:09 GMT, Radim Vansa wrote: >> When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use RESTORE_SETTABLE on JVM flags > > * Fail early when using non settable flags > * CracBuilder fix: don't use classpath during restore I think that we could write the name of the engine used to the checkpoint directory, that way the user won't need to pass the engine on restore. ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1534201741 From duke at openjdk.org Thu May 4 07:24:42 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 07:24:42 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> Message-ID: On Fri, 28 Apr 2023 11:57:03 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with two additional commits since the last revision: >> >> - More fine-grained synchronization >> - Rework context ordering (round 2) >> >> * call afterRestore even if beforeCheckpoint throws >> * registering resource in previous/running context does not trigger exception immediatelly >> ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * we don't guarantee threads not deadlocking when trying to register a resource, though > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 75: > >> 73: restoreQ = new ArrayList<>(); >> 74: runBeforeCheckpoint(); >> 75: Collections.reverse(restoreQ); > > Smelly code, restoreQ should be maintained either here or in runBeforeCheckpoint() Not really; the task of ACI subclass (OC, PC...) is to call `invokeBeforeCheckpoint` on some resources (ACI does not know which ones) in some order. The task of ACI is to remember the order of invocations and in `afterRestore` call this in a reversed order; the subclass does not need to know about any collection used for that. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1184634719 From duke at openjdk.org Thu May 4 09:54:05 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 4 May 2023 09:54:05 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 > > That IMO does not preclude trying the same for this case. > > - Debian 11 x86_64: It does not work, glibc is too different and inlined there. > - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. > - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: -altstack ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/9e6faf67..b22bb537 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=13 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=12-13 Stats: 11 lines in 1 file changed: 0 ins; 9 del; 2 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Thu May 4 10:27:49 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 10:27:49 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v4] In-Reply-To: References: Message-ID: <1xFtRyDqF-BREE5Vq1SSog1fbMG4y6JGjF1SlC4rBmQ=.8df4faf4-8e82-4d16-a638-9a64687ae1a9@github.com> > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Fix javadoc and minor refactoring - Merge branch 'crac' into context_order - More fine-grained synchronization - Rework context ordering (round 2) * call afterRestore even if beforeCheckpoint throws * registering resource in previous/running context does not trigger exception immediatelly ** instead this will be one of the recorded exceptions and the resource has a chance to fire next time * we don't guarantee threads not deadlocking when trying to register a resource, though - Fix docs & package - Fix ordering of invocation on Resources * When Context.beforeCheckpoint throws, invoke Context.afterRestore anyway (otherwise some resources stay in suspended state). * Handle Resource.beforeCheckpoint triggering a registration of another resource ** Do not cause deadlock when registering from another thread ** Global resource can register JDKResource ** JDKResource can register resource with higher priority ** Other registrations are prohibited ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/1f2c7b39..eafdb841 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=02-03 Stats: 253 lines in 10 files changed: 97 ins; 102 del; 54 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Thu May 4 13:20:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 13:20:45 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v5] In-Reply-To: References: Message-ID: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Fix javadoc build ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/eafdb841..33226eba Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=03-04 Stats: 4 lines in 2 files changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From akozlov at openjdk.org Thu May 4 13:41:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 13:41:56 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain Message-ID: If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. ------------- Commit messages: - Fix copyright - Notify on the original thread - Ensure all notifications finish even if only daemon threads remain Changes: https://git.openjdk.org/crac/pull/62/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=62&range=00 Stats: 146 lines in 2 files changed: 136 ins; 3 del; 7 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From duke at openjdk.org Thu May 4 15:42:55 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 15:42:55 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain In-Reply-To: References: Message-ID: On Thu, 4 May 2023 13:34:47 GMT, Anton Kozlov wrote: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. The code is doing something (starting a thread) before the checkpoint, and after restore (join the thread). Rather than clobbering the 'general' C/R code with a fix for one issue, there is an interface perfectly suited to host this - a Resource. There's a catch, though - if we register it to the JDKContext this would be run *after* user Resources, and we need to run this *before*. Right now we don't have any means to order things this way. One solution could be reversing the JDKContext/GlobalContext relationship - JDKContext would be the parent, and would use a specific priority class for GlobalContext (something before `NORMAL`). And there would be one more priority class even before that, and this `KeepaliveResource` would be registered at it. src/java.base/share/classes/jdk/crac/Core.java line 254: > 252: // The notifications are done on the original thread. > 253: CountDownLatch start = new CountDownLatch(1); > 254: CountDownLatch finish = new CountDownLatch(1); Rather than 2 CountDownLatches you could use either single `CyclicBarrier` (awaiting twice), or even better a `Phaser` that has API for non-interruptible wait. test/jdk/jdk/crac/DaemonAfterRestore.java line 82: > 80: System.out.println("worker thread finish"); > 81: }); > 82: workerThread.setDaemon(false); Unnecessary, thread created from main thread (non-daemon) is not a daemon either. test/jdk/jdk/crac/DaemonAfterRestore.java line 94: > 92: @Override > 93: public void beforeCheckpoint(Context context) throws Exception { > 94: finish.countDown(); It might be worth asserting that here we are running in a daemon thread. test/jdk/jdk/crac/DaemonAfterRestore.java line 98: > 96: @Override > 97: public void afterRestore(Context context) throws Exception { > 98: Thread.sleep(3000); Could we avoid extending the testsuite run by 3 seconds? I know we're trying to assert that 'nothing happens' rather then replacing waiting for an event by a timed wait. If we can't check that the destroy VM thread had a chance to work, I suggest at least reducing this (say 100ms?). ------------- PR Review: https://git.openjdk.org/crac/pull/62#pullrequestreview-1413342072 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185159712 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185171160 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185175830 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185196948 From akozlov at openjdk.org Thu May 4 16:47:52 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 16:47:52 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v5] In-Reply-To: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> References: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> Message-ID: <1z4j4m_Jk6Cb2QfjGxTFEJp0ah_pOsxxSfckjRvv8CI=.02a74c4c-87e5-4975-a528-245d3caa00c2@github.com> On Thu, 4 May 2023 13:20:45 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Fix javadoc build This change gets too complex with refactoring, API changes, documentation, and fixes. Let's simplify this. Regarding fix, I'd like the minimal change without refactorings, API changes, doc changes. I believe ACI changes are not necessary for that. The test is nice. src/java.base/share/classes/javax/crac/Core.java line 53: > 51: * reference to the resource - otherwise the garbage collector > 52: * is free to trash the resource and notifications on this resource > 53: * will not be invoked. Instead, highlight the rationale behind weak registration: it does not change live-cycle the object, so any object may register with the Context without additional implications rather than notification. But if the object is not strongly-reachable, it can be collected before the notification. src/java.base/share/classes/javax/crac/package-info.java line 87: > 85: *
  • When an exception is thrown during notificaion, it is caught by the {@code Context} and is suppressed by a {@code CheckpointException} or {@code RestoreException}, depends on the throwing method. > 86: *
  • > 87: *
  • When the {@code Resource} is a {@code Context} and it throws {@code CheckpointException} or {@code RestoreException}, exceptions suppressed by the original exception are suppressed by another {@code CheckpointException} or {@code RestoreException}, depends on the throwing method. This sepcification for child context was lost ------------- PR Review: https://git.openjdk.org/crac/pull/60#pullrequestreview-1413175828 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185072093 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185063640 From akozlov at openjdk.org Thu May 4 16:47:52 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 16:47:52 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: <50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com> References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> <50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com> Message-ID: On Tue, 2 May 2023 10:41:36 GMT, Radim Vansa wrote: >> Just realized that this is required for PriorityContext implementation. But that is the implementation of that class, it's wrong ACI has to care about that. > > Yes, it's a bit of enforced flexibility of the base class (through allowing to override a method), though it doesn't need about the use case. > > I could use just a list rather than sub-contexts but that would require duplicated code. That would be a cleaner approach compared to interdependencies of ACI-PC (details of PC leaks to ACI) >>> The contexts are supposed to run all resources regardless of whether any failed, therefore there's no point in propagating and re-wrapping the exceptions. ... I don't see why would the parent context decide on the type - because it would get just CheckpointException with a list of other failures from components it does not understand. >> >> The parent Context may implement an artbitrary handling (for example, unloading a component compleltely if that is throwing an exception). So throwing an exception is something useful. >> >> With that, the new Core.recordException is completely new exception flow, that just opimizes somthing the generic throw scheme. With that, the generic schemes should be something good enough already, we don't need to complicate the interface, the code,.. CheckpointExcepotions are still exceptions, that is, we don't expect them often, there is no need to optimize them. > > It's not about optimization (in the sense of performance) but about removing code bloat, and the need for parent context to tediously copy failures (deciding whether something was a wrapper exception or the actual failure), when all you need to do in the end is to report them in bulk. It's not a new exception flow, it's removing the flow as there is no exception flow needed. > > Your example is invalid: If a throwing resource is to be removed, it's the task of the parent context - and that one will see the exception. The parent context should not remove its child context since one of the N resources in the child context is failing. You're trading additional code for a complicated interface. What are the directions what to use: Core.recordException() vs throw new Exception() ? This two very similar interfaces are the sign we are trying to do something strange. > Your example is invalid: If a throwing resource is to be removed, it's the task of the parent context - and that one will see the exception. The parent context should not remove its child context since one of the N resources in the child context is failing. Could you elaborate?.. I'm lost what "remove child context" means. I was reffering to the parent context being able to handle a CheckpointException from the child context. >> I'm trying to describe class hierachy and failing. The patch tries to reverse ACI-PC relation. ACI was for partially-ordered Resources (defined by a Comparator), and now it's for totally-ordered Resources (ordered by long). Trying to fit the partially-ordering PC as a subclass of the totally-ordering ACI feels unnatural. Can we have a cleaner hierachy of the classes? > > No; ACI was totally-ordered in the previous version of the PR, but now it doesn't care about ordering at all; it's the abstract `runBeforeCheckpoint` that decides on the order. I see, thanks. So ACI task is to track beforeCheckpoint order and provide afterRestore in the opposite order, this seems the most useful part of it. But its interface does not seem to be very consistent. Does it really needs an abstract base class that also maintains calling, logging and exception propagation? It will be nice to separate all this concerns for greater flexibility e.g. with composition of some classes in a Context implementation, rather than that to extend the base class with all of them combined. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185090313 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185113136 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185223155 From akozlov at openjdk.org Thu May 4 17:17:46 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 17:17:46 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References: Message-ID: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Test update ------------- Changes: - all: https://git.openjdk.org/crac/pull/62/files - new: https://git.openjdk.org/crac/pull/62/files/398cc79e..1dd4e20e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=62&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=62&range=00-01 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From akozlov at openjdk.org Thu May 4 17:25:46 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 17:25:46 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 17:17:46 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Test update Thank you for review. > Rather than clobbering the 'general' C/R code with a fix for one issue, there is an interface perfectly suited to host this - a Resource. Having Resource abstraction does not mean that is necessary to use. Resources are suited for objects which may or may not exist and still need to receive notifications. Here we have no problems with doing something directly. So we don't need to rely on some implicit ordering, nor don't need to change Contextes structure. ------------- PR Review: https://git.openjdk.org/crac/pull/62#pullrequestreview-1413543170 From akozlov at openjdk.org Thu May 4 17:25:48 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 17:25:48 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References: Message-ID: <-d4Kyg0bXgZ03thUNHiuc4qbedXjxmevSfwwHgzejT4=.d716ad99-573c-46ff-bf75-e84c851936db@github.com> On Thu, 4 May 2023 15:04:11 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Test update > > src/java.base/share/classes/jdk/crac/Core.java line 254: > >> 252: // The notifications are done on the original thread. >> 253: CountDownLatch start = new CountDownLatch(1); >> 254: CountDownLatch finish = new CountDownLatch(1); > > Rather than 2 CountDownLatches you could use either single `CyclicBarrier` (awaiting twice), or even better a `Phaser` that has API for non-interruptible wait. Since both of suggested options are reusable, having separted events is cleaner and reduce chances to accidentally await when that was intended. > test/jdk/jdk/crac/DaemonAfterRestore.java line 82: > >> 80: System.out.println("worker thread finish"); >> 81: }); >> 82: workerThread.setDaemon(false); > > Unnecessary, thread created from main thread (non-daemon) is not a daemon either. Otherwise there should be an assert, so it's more straightforward to set the mode. > test/jdk/jdk/crac/DaemonAfterRestore.java line 98: > >> 96: @Override >> 97: public void afterRestore(Context context) throws Exception { >> 98: Thread.sleep(3000); > > Could we avoid extending the testsuite run by 3 seconds? I know we're trying to assert that 'nothing happens' rather then replacing waiting for an event by a timed wait. If we can't check that the destroy VM thread had a chance to work, I suggest at least reducing this (say 100ms?). 100ms does not look as a good candidate (comparable to a single scheduling period). So 3 sec does not look to bad to avoid interrmittent false positive because sleep() finished before VM has terminated. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185290868 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185291965 PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1185295106 From akozlov at openjdk.org Thu May 4 18:13:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 4 May 2023 18:13:43 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 09:54:05 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there is a precedent OpenJDK upstream has already rejected such re-exec idea in the past: https://github.com/openjdk/crac/pull/31#issuecomment-1275707621 >> >> That IMO does not preclude trying the same for this case. >> >> - Debian 11 x86_64: It does not work, glibc is too different and inlined there. >> - Debian 12 x86_64: It works even without libc6-dbg as its offsets are the default. >> - Fedora 36 x86_64: It works as on Fedoras glibc debuginfo is embedded. > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > -altstack > * On older glibcs not supporting macro `CPU_FEATURE_ACTIVE` the disabling of glibc features has no effect (and it may crash the migration even after using `-XX:CPUFeatures=generic`). What is GLIBC version supporting the flag? We're used to build JDK on some older platform and assume that will work on every newer platform. And it turns out on my platform used for the builds the option is not supported. Since it's required to specify TUNABLES in the text form, can we just define needed options names? I'll continue reviewing this PR. src/hotspot/cpu/x86/vm_version_x86.cpp line 679: > 677: > 678: uint64_t disable_CPU = 0; > 679: uint64_t disable_GLIBC = 0; Are these used in EXCESSIVEx macro? Could you please move these below then, closer to the use. src/hotspot/cpu/x86/vm_version_x86.cpp line 717: > 715: if ((excessive_CPU & CPU_SSE3) || > 716: (excessive_GLIBC & (GLIBC_CMPXCHG16 | GLIBC_LAHFSAHF))) { > 717: assert(!(excessive_CPU & CPU_SSE4_2), "(_features & CPU_SSE4_2) cannot happen"); Failed assert prints the failed condition, no need to repeat in the message. src/hotspot/cpu/x86/vm_version_x86.cpp line 731: > 729: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, FMA) > 730: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, LZCNT) > 731: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)) SKARA complians on this line src/hotspot/cpu/x86/vm_version_x86.cpp line 767: > 765: #else > 766: # define IF_ASSERT(x) > 767: #endif Exactly the definition of `DEBUG_ONLY`, please use that macro. src/hotspot/share/runtime/stubCodeGenerator.cpp line 62: > 60: void StubCodeDesc::thaw() { > 61: assert(_frozen, "repeated thaw operation"); > 62: _frozen = false; Is it still necessary? I've tried to comment this line out, and checkpoint-restore succeded for me. ------------- PR Review: https://git.openjdk.org/crac/pull/41#pullrequestreview-1413589870 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185335055 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185336541 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185347208 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185333573 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1185318814 From duke at openjdk.org Thu May 4 20:32:49 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 20:32:49 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 17:17:46 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Test update If not using a Resource, at least refactor that into a standalone (stateful) component, which may be called directly. The way it's written here mixes in some implementation details into a higher-level flow. ------------- PR Comment: https://git.openjdk.org/crac/pull/62#issuecomment-1535371199 From duke at openjdk.org Thu May 4 20:43:40 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 20:43:40 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v5] In-Reply-To: <1z4j4m_Jk6Cb2QfjGxTFEJp0ah_pOsxxSfckjRvv8CI=.02a74c4c-87e5-4975-a528-245d3caa00c2@github.com> References: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> <1z4j4m_Jk6Cb2QfjGxTFEJp0ah_pOsxxSfckjRvv8CI=.02a74c4c-87e5-4975-a528-245d3caa00c2@github.com> Message-ID: <64JIfZfl78X-cwOqXZrMb-NfHx4iSUEUmcqFK3F-imM=.8ea7d9ce-5b9c-4d81-9369-72c6102ac1aa@github.com> On Thu, 4 May 2023 13:59:20 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix javadoc build > > src/java.base/share/classes/javax/crac/package-info.java line 87: > >> 85: *
  • When an exception is thrown during notificaion, it is caught by the {@code Context} and is suppressed by a {@code CheckpointException} or {@code RestoreException}, depends on the throwing method. >> 86: *
  • >> 87: *
  • When the {@code Resource} is a {@code Context} and it throws {@code CheckpointException} or {@code RestoreException}, exceptions suppressed by the original exception are suppressed by another {@code CheckpointException} or {@code RestoreException}, depends on the throwing method. > > This sepcification for child context was lost Intentionally. With reporting the exception centrally there's no reason to inform parent context about an error in a resource in child component. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185492544 From duke at openjdk.org Thu May 4 21:04:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 21:04:54 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> <50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com> Message-ID: On Thu, 4 May 2023 14:35:46 GMT, Anton Kozlov wrote: > You're trading additional code for a complicated interface. What are the directions what to use: Core.recordException() vs throw new Exception() ? This two very similar interfaces are the sign we are trying to do something strange. That's because of the non-standard requirement to continue after encountering an exception, and multiple rewraps in a hierarchy of contexts. Resources should normally throw exceptions; Contexts are here to call into the Resource and aggregate errors. There's no point of propagating error higher up the hierarchy. > Could you elaborate?.. I'm lost what "remove child context" means. I was reffering to the parent context being able to handle a CheckpointException from the child context. And why would the child context throw the CheckpointException? The most common reason would be that one of X resources in child context has thrown. It's fine if a (child) Context removes the inner Resource after throwing. But that error should not propagate any higher, because if parent context were to remove it's (throwing) child context it would remove it along with the other X - 1 resources that are working correctly. >> No; ACI was totally-ordered in the previous version of the PR, but now it doesn't care about ordering at all; it's the abstract `runBeforeCheckpoint` that decides on the order. > > I see, thanks. So ACI task is to track beforeCheckpoint order and provide afterRestore in the opposite order, this seems the most useful part of it. But its interface does not seem to be very consistent. > > Does it really needs an abstract base class that also maintains calling, logging and exception propagation? It will be nice to separate all this concerns for greater flexibility e.g. with composition of some classes in a Context implementation, rather than that to extend the base class with all of them combined. We can further fine-tune the separation when the need arises and we have a concrete example. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185508424 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185510445 From duke at openjdk.org Thu May 4 21:35:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 21:35:41 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v6] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Updated Global Context javadoc, removed semanticContext() ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/33226eba..53ccc062 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=05 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=04-05 Stats: 50 lines in 4 files changed: 36 ins; 1 del; 13 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Thu May 4 21:52:50 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 21:52:50 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v7] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Add the missing javadoc ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/53ccc062..729a4537 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=06 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=05-06 Stats: 12 lines in 1 file changed: 11 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Thu May 4 21:54:47 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 21:54:47 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v6] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 21:35:41 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Updated Global Context javadoc, removed semanticContext() I've reworded Global Context javadoc, added forgotten javadoc to `recordException`, and replaced calling `semanticContext()` with inlined versions of the `invokeBeforeCheckpoint` and `invokeAfterRestore` methods. All changes in this PR are related to contexts. I see little point in dosing doc changes individually; API changes are here to support actual implementation of the fix. ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1535456836 From duke at openjdk.org Thu May 4 22:09:04 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 4 May 2023 22:09:04 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v7] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 21:52:50 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Add the missing javadoc There is one negligence in this PR and that is the lack of test for concurrent registration and notification in the JDK context. I've omitted that partly as this was happening reproducibly in the `newfd` branch, therefore I thought a synthetic test was not necessary. I could add it, though, to demonstrate the behaviour. ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1535468906 From duke at openjdk.org Fri May 5 05:49:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 5 May 2023 05:49:44 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v3] In-Reply-To: References: <-aCCM-B8gVgwEFggEN0XsbkdxBZTh_-g5blSYd5AXN8=.6ddeed9f-a8bb-44b8-89e4-985b7b9e3143@github.com> <50suyCtjP8Ke3x6c2hjzSA-v-9sg2HjQNTuIJtrUYx8=.026b7b21-e44e-463d-af72-2ed789a44c41@github.com> Message-ID: On Thu, 4 May 2023 20:59:42 GMT, Radim Vansa wrote: >> You're trading additional code for a complicated interface. What are the directions what to use: Core.recordException() vs throw new Exception() ? This two very similar interfaces are the sign we are trying to do something strange. >> >>> Your example is invalid: If a throwing resource is to be removed, it's the task of the parent context - and that one will see the exception. The parent context should not remove its child context since one of the N resources in the child context is failing. >> >> Could you elaborate?.. I'm lost what "remove child context" means. I was reffering to the parent context being able to handle a CheckpointException from the child context. > >> You're trading additional code for a complicated interface. What are the directions what to use: Core.recordException() vs throw new Exception() ? This two very similar interfaces are the sign we are trying to do something strange. > > That's because of the non-standard requirement to continue after encountering an exception, and multiple rewraps in a hierarchy of contexts. > Resources should normally throw exceptions; Contexts are here to call into the Resource and aggregate errors. There's no point of propagating error higher up the hierarchy. > >> Could you elaborate?.. I'm lost what "remove child context" means. I was reffering to the parent context being able to handle a CheckpointException from the child context. > > And why would the child context throw the CheckpointException? The most common reason would be that one of X resources in child context has thrown. It's fine if a (child) Context removes the inner Resource after throwing. But that error should not propagate any higher, because if parent context were to remove it's (throwing) child context it would remove it along with the other X - 1 resources that are working correctly. I realized I haven't stressed enough one motivation for `Core.registerException`: when the context has finished its `beforeCheckpoint` (it won't be called again) and someone calls into this Context's `register` we cannot propagate the failure through throwing. So we won't avoid API change - and in my view it's natural to use the same method for reporting the failure during `beforeCheckpoint` execution, too. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1185718419 From akozlov at openjdk.org Fri May 5 10:54:54 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 5 May 2023 10:54:54 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v3] In-Reply-To: References: Message-ID: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Fix recursiveCheckpoint test ------------- Changes: - all: https://git.openjdk.org/crac/pull/62/files - new: https://git.openjdk.org/crac/pull/62/files/1dd4e20e..86fe7eb7 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=62&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=62&range=01-02 Stats: 47 lines in 2 files changed: 24 ins; 19 del; 4 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From akozlov at openjdk.org Fri May 5 11:16:50 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 5 May 2023 11:16:50 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: References: Message-ID: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Cleanup ------------- Changes: - all: https://git.openjdk.org/crac/pull/62/files - new: https://git.openjdk.org/crac/pull/62/files/86fe7eb7..c2cfae5d Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=62&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=62&range=02-03 Stats: 46 lines in 1 file changed: 28 ins; 10 del; 8 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From duke at openjdk.org Fri May 5 16:03:53 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 5 May 2023 16:03:53 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: References: Message-ID: On Fri, 5 May 2023 11:16:50 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Cleanup src/java.base/share/classes/jdk/crac/Core.java line 287: > 285: try { > 286: keepAlive = new KeepAlive(); > 287: } catch (InterruptedException e) { Upon catching InterruptedException you should set thread interrupted status. Any reason to use RuntimeException than CheckpointException? (preferrably in a comment). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1186258857 From duke at openjdk.org Fri May 5 16:03:53 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 5 May 2023 16:03:53 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: References: Message-ID: On Fri, 5 May 2023 15:57:32 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Cleanup > > src/java.base/share/classes/jdk/crac/Core.java line 287: > >> 285: try { >> 286: keepAlive = new KeepAlive(); >> 287: } catch (InterruptedException e) { > > Upon catching InterruptedException you should set thread interrupted status. > > Any reason to use RuntimeException than CheckpointException? (preferrably in a comment). Also, if you're just rethrowing runtime exception you could move the try-catch into the class. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1186262728 From duke at openjdk.org Fri May 5 16:03:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 5 May 2023 16:03:54 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: <-d4Kyg0bXgZ03thUNHiuc4qbedXjxmevSfwwHgzejT4=.d716ad99-573c-46ff-bf75-e84c851936db@github.com> References: <-d4Kyg0bXgZ03thUNHiuc4qbedXjxmevSfwwHgzejT4=.d716ad99-573c-46ff-bf75-e84c851936db@github.com> Message-ID: On Thu, 4 May 2023 17:12:32 GMT, Anton Kozlov wrote: >> test/jdk/jdk/crac/DaemonAfterRestore.java line 82: >> >>> 80: System.out.println("worker thread finish"); >>> 81: }); >>> 82: workerThread.setDaemon(false); >> >> Unnecessary, thread created from main thread (non-daemon) is not a daemon either. > > Otherwise there should be an assert, so it's more straightforward to set the mode. I agree with the assert ------------- PR Review Comment: https://git.openjdk.org/crac/pull/62#discussion_r1186259922 From duke at openjdk.org Tue May 9 15:34:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 9 May 2023 15:34:09 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v9] In-Reply-To: <2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> References: <2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> Message-ID: On Tue, 2 May 2023 16:50:11 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Provide more information for file descriptors > > src/java.base/share/classes/java/io/FileDescriptor.java line 400: > >> 398: } else { >> 399: info = (path != null ? path : "unknown path") + " (" + (type != null ? type : "unknown") + ")"; >> 400: } > > This have too many socket-related details, also a number of java/native transitions that will be unavoidable if we adopt the proposed interface. I am thinking about this also in a context of `newfd-policies`, where we have to record not only the type & path but also things like current offset etc. Since this is used only in error handling, I wouldn't mind about the native transitions (or performance in general) too much. If you want, I could create a POJO to cross the native border only once. However this goes a bit against the principle that we want to handle as much as we can in Java rather than native code. That's also the reason why I opted for the formatting in Java even though it might be just as simple to do it in the native. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1188785933 From duke at openjdk.org Tue May 9 15:38:11 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 9 May 2023 15:38:11 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v9] In-Reply-To: <2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> References: <2uXm_DAQyImefAYUr1Ujxgxwu8bP0gqzhCTfvWNbDzE=.5ce26d49-9bbd-4097-9138-689391145509@github.com> Message-ID: On Tue, 2 May 2023 17:00:11 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Provide more information for file descriptors > > src/java.base/unix/native/libnet/SocketImpl.c line 115: > >> 113: return NULL; >> 114: } >> 115: } > > No need for `} else {` here and everywhere else since the previous block has anyway terminated the function. > > This will make the code more streamlined. > > Suggestion: > > } > > localAddr = create_isa(env, isa_class, isa_ctor, &local); > if (localAddr == NULL) { > JNU_ThrowOutOfMemoryError(env, "java.net.InetSocketAddres"); > return NULL; > } The code was supposed to mirror closer the remote part where we can't do that, but OK. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/43#discussion_r1188790339 From duke at openjdk.org Tue May 9 16:01:30 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 9 May 2023 16:01:30 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: References: Message-ID: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Revert API change, force blocking registration ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/729a4537..868baeae Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=07 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=06-07 Stats: 338 lines in 8 files changed: 168 ins; 88 del; 82 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From akozlov at openjdk.org Tue May 9 18:24:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 9 May 2023 18:24:53 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: <64JIfZfl78X-cwOqXZrMb-NfHx4iSUEUmcqFK3F-imM=.8ea7d9ce-5b9c-4d81-9369-72c6102ac1aa@github.com> References: <6wYbr4YOdVQDkBozfKYt4XHUQ29BrybZZCX43Z_clng=.4015de97-0f1a-40b0-b7c6-0ced46df2fc3@github.com> <1z4j4m_Jk6Cb2QfjGxTFEJp0ah_pOsxxSfckjRvv8CI=.02a74c4c-87e5-4975-a528-245d3caa00c2@github.com> <64JIfZfl78X-cwOqXZrMb-NfHx4iSUEUmcqFK3F-imM=.8ea7d9ce-5b9c-4d81-9369-72c6102ac1aa@github.com> Message-ID: On Thu, 4 May 2023 20:41:11 GMT, Radim Vansa wrote: >> src/java.base/share/classes/javax/crac/package-info.java line 87: >> >>> 85: *
  • When an exception is thrown during notificaion, it is caught by the {@code Context} and is suppressed by a {@code CheckpointException} or {@code RestoreException}, depends on the throwing method. >>> 86: *
  • >>> 87: *
  • When the {@code Resource} is a {@code Context} and it throws {@code CheckpointException} or {@code RestoreException}, exceptions suppressed by the original exception are suppressed by another {@code CheckpointException} or {@code RestoreException}, depends on the throwing method. >> >> This sepcification for child context was lost > > Intentionally. With reporting the exception centrally there's no reason to inform parent context about an error in a resource in child component. The reply no longer valid. Please keep modifications in the javadoc to minimum as well ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1188966993 From akozlov at openjdk.org Tue May 9 18:24:54 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 9 May 2023 18:24:54 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> References: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> Message-ID: On Tue, 9 May 2023 16:01:30 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert API change, force blocking registration src/java.base/share/classes/jdk/crac/Resource.java line 65: > 63: * resource throwing an exception when {@link #beforeCheckpoint(Context) > 64: * beforeCheckpoint}. > 65: * Therefore, the resource should not have assumptions about it state; it Resource can be sure the beforeCheckpoint was called, and object is exactly in the state at which the beforeCheckpoint has leaved it. src/java.base/share/classes/jdk/crac/impl/OrderedContext.java line 76: > 74: @Override > 75: public void afterRestore(Context context) throws RestoreException { > 76: // Note: a resource might attempt to Comment truncated?.. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1188970172 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1188980196 From duke at openjdk.org Wed May 10 06:27:40 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 06:27:40 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v2] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 17:23:28 GMT, Anton Kozlov wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Test update > > Thank you for review. > >> Rather than clobbering the 'general' C/R code with a fix for one issue, there is an interface perfectly suited to host this - a Resource. > > Having Resource abstraction does not mean that is necessary to use. Resources are suited for objects which may or may not exist and still need to receive notifications. Here we have no problems with doing something directly. So we don't need to rely on some implicit ordering, nor don't need to change Contextes structure. @AntonKozlov I've addressed the interrupts and moved KeepAlive to separate impl class in https://github.com/rvansa/crac/tree/daemon-after-restore - could you ff-merge into your branch to avoid opening another PR? ------------- PR Comment: https://git.openjdk.org/crac/pull/62#issuecomment-1541421555 From duke at openjdk.org Wed May 10 07:11:35 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 07:11:35 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v9] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Update docs ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/868baeae..9e81179f Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=08 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=07-08 Stats: 16 lines in 4 files changed: 10 ins; 2 del; 4 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Wed May 10 07:11:36 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 07:11:36 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> References: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> Message-ID: On Tue, 9 May 2023 16:01:30 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert API change, force blocking registration Updated docs, no code changes. ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1541463175 From duke at openjdk.org Wed May 10 07:11:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 07:11:38 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v8] In-Reply-To: References: <49t0sZrR1pRQ2NQLIqqx_MJ9ZKfjeTW6pSLkSX-4pd8=.cde19724-931c-4b95-af4d-17833947afb0@github.com> Message-ID: On Tue, 9 May 2023 18:14:31 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert API change, force blocking registration > > src/java.base/share/classes/jdk/crac/Resource.java line 65: > >> 63: * resource throwing an exception when {@link #beforeCheckpoint(Context) >> 64: * beforeCheckpoint}. >> 65: * Therefore, the resource should not have assumptions about it state; it > > Resource can be sure the beforeCheckpoint was called, and object is exactly in the state at which the beforeCheckpoint has leaved it. I'll reword. The meaning here was to guide user to handling unexpected states, e.g. if the beforeCheckpoint is locking a lock this should not blindly unlock. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189445759 From duke at openjdk.org Wed May 10 09:05:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 09:05:38 GMT Subject: [crac] RFR: Support passing extra options in CREngine Message-ID: In addition to `-XX:CREngine=program` this adds support to `-XX:CREngine=program,key=value,anotherkey` that translates into invoking `program --key value --anotherkey`. This generic parameters support is utilized in `criuengine` that accepts `--verbosity` and `--log-file` options and relays them to `criu`. ------------- Commit messages: - Support passing extra options in CREngine Changes: https://git.openjdk.org/crac/pull/63/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=63&range=00 Stats: 150 lines in 3 files changed: 124 ins; 5 del; 21 mod Patch: https://git.openjdk.org/crac/pull/63.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/63/head:pull/63 PR: https://git.openjdk.org/crac/pull/63 From duke at openjdk.org Wed May 10 09:38:50 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 09:38:50 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Remove mention of single-threaded notifications ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/9e81179f..286d820d Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=09 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=08-09 Stats: 10 lines in 2 files changed: 0 ins; 6 del; 4 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From akozlov at openjdk.org Wed May 10 10:08:51 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 10:08:51 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References: Message-ID: <-gXSpBhareHQmkkgHmzZqGv010PEdhYW8C1HZjcirF4=.69e3174c-32dd-4d82-96c0-7d48ab169001@github.com> On Wed, 10 May 2023 09:38:50 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove mention of single-threaded notifications First itertation, have not looked carefully on the doc and context changes src/java.base/share/classes/jdk/internal/crac/JDKContext.java line 42: > 40: }; > 41: > 42: public JDKContext() { Why `public`? src/java.base/share/classes/jdk/internal/crac/LoggerContainer.java line 10: > 8: */ > 9: public class LoggerContainer { > 10: public static final System.Logger logger = System.getLogger("jdk.internal.crac"); Please keep `jdk.crac` as having the code in jdk.internal.crac is implementation detail, but this name is a configuration interface for users. test/jdk/jdk/crac/LazyProps.java line 54: > 52: Core.getGlobalContext().register(resource); > 53: > 54: System.setProperty("jdk.crac.debug", "true"); The test was added in https://github.com/openjdk/crac/pull/12, preventing problems with access to the properties happens before j.l.System initialized. But the test checks the logging can be enabled as late as possible before checkpoint. Since the logging enabled in the command line, the test makes a little sense now... ------------- PR Review: https://git.openjdk.org/crac/pull/60#pullrequestreview-1420232750 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189650678 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189655234 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189666134 From akozlov at openjdk.org Wed May 10 10:40:43 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 10:40:43 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 09:38:50 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove mention of single-threaded notifications Context changes are mostly good, except a small nit src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 58: > 56: } > 57: if (e.getMessage() != null) { > 58: ce.addSuppressed(e); What happens with exceptions suppressed by `e`? will we have the same set of exceptions suppressed by `ce` and `e`? ------------- PR Review: https://git.openjdk.org/crac/pull/60#pullrequestreview-1420308681 PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189698760 From duke at openjdk.org Wed May 10 10:50:49 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 10:50:49 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References: Message-ID: <5PQTBR3qsRGJ0sTa0goQO-3MIHbUgEkl5406pgVKc_8=.496ff324-d3b3-46e6-97be-7f4c8cc78d8f@github.com> On Wed, 10 May 2023 09:38:50 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Remove mention of single-threaded notifications test/jdk/jdk/crac/ContextOrderTest.java line 54: > 52: System.setProperty("java.util.logging.config.file", Utils.TEST_SRC + "/logging.properties"); > 53: > 54: // testOrder(); Oops, noticed this got commented out. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189726828 From duke at openjdk.org Wed May 10 10:56:41 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 10:56:41 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 10:22:55 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove mention of single-threaded notifications > > src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 58: > >> 56: } >> 57: if (e.getMessage() != null) { >> 58: ce.addSuppressed(e); > > What happens with exceptions suppressed by `e`? will we have the same set of exceptions suppressed by `ce` and `e`? Yes; there's no way to remove already suppressed exceptions, and the message is too valuable to lose it. I could create an exception with the same message but without suppressed exceptions - then I'd lose its stack trace, though. I don't see much issues in this duplication. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189732758 From duke at openjdk.org Wed May 10 11:08:05 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 11:08:05 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v11] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with three additional commits since the last revision: - Fix build - Minified the set of changes - Fixup ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/286d820d..4adad664 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=10 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=09-10 Stats: 268 lines in 15 files changed: 62 ins; 178 del; 28 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Wed May 10 11:14:00 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 11:14:00 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v12] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Make comparator private ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/4adad664..29272639 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=11 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=10-11 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Wed May 10 12:26:48 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 12:26:48 GMT Subject: [crac] RFR: Minor code cleanup and improvements Message-ID: Extracted non-essential changes from other PR. ------------- Commit messages: - Minor code cleanup and improvements Changes: https://git.openjdk.org/crac/pull/64/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=64&range=00 Stats: 81 lines in 8 files changed: 47 ins; 16 del; 18 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From duke at openjdk.org Wed May 10 12:49:03 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 12:49:03 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Revert removing the logging configuration ------------- Changes: - all: https://git.openjdk.org/crac/pull/60/files - new: https://git.openjdk.org/crac/pull/60/files/29272639..841d0989 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=60&range=12 - incr: https://webrevs.openjdk.org/?repo=crac&pr=60&range=11-12 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/crac/pull/60.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/60/head:pull/60 PR: https://git.openjdk.org/crac/pull/60 From duke at openjdk.org Wed May 10 12:49:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 12:49:14 GMT Subject: [crac] RFR: Refactored javadocs with additional details Message-ID: Improve API documentation. ------------- Commit messages: - Refactored javadocs with additional details Changes: https://git.openjdk.org/crac/pull/65/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=65&range=00 Stats: 160 lines in 6 files changed: 105 ins; 46 del; 9 mod Patch: https://git.openjdk.org/crac/pull/65.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/65/head:pull/65 PR: https://git.openjdk.org/crac/pull/65 From akozlov at openjdk.org Wed May 10 13:14:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 13:14:53 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 12:49:03 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert removing the logging configuration LGTM! Let's wait for tests result ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/60#pullrequestreview-1420614877 From akozlov at openjdk.org Wed May 10 13:14:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 13:14:53 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References: Message-ID: <0xqgNmVgOMlldlClyg_1cfS-cyPC4as5CRyhdSJUwgU=.3e67c52e-0f7c-4572-9d97-dfb69a0f45ff@github.com> On Wed, 10 May 2023 10:54:03 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java line 58: >> >>> 56: } >>> 57: if (e.getMessage() != null) { >>> 58: ce.addSuppressed(e); >> >> What happens with exceptions suppressed by `e`? will we have the same set of exceptions suppressed by `ce` and `e`? > > Yes; there's no way to remove already suppressed exceptions, and the message is too valuable to lose it. I could create an exception with the same message but without suppressed exceptions - then I'd lose its stack trace, though. > I don't see much issues in this duplication. It seems we should have allowed the message in the CheckpointException, to avoid this problem. The only place where CheckpointException with the message is called is recursive checkpoint detection from https://github.com/openjdk/crac/pull/6. OK for this PR. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189889412 From duke at openjdk.org Wed May 10 13:50:54 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 13:50:54 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: <0xqgNmVgOMlldlClyg_1cfS-cyPC4as5CRyhdSJUwgU=.3e67c52e-0f7c-4572-9d97-dfb69a0f45ff@github.com> References: <0xqgNmVgOMlldlClyg_1cfS-cyPC4as5CRyhdSJUwgU=.3e67c52e-0f7c-4572-9d97-dfb69a0f45ff@github.com> Message-ID: On Wed, 10 May 2023 13:10:24 GMT, Anton Kozlov wrote: >> Yes; there's no way to remove already suppressed exceptions, and the message is too valuable to lose it. I could create an exception with the same message but without suppressed exceptions - then I'd lose its stack trace, though. >> I don't see much issues in this duplication. > > It seems we should have allowed the message in the CheckpointException, to avoid this problem. The only place where CheckpointException with the message is called is recursive checkpoint detection from https://github.com/openjdk/crac/pull/6. > > OK for this PR. That's not entirely true; CheckpointException is a public API and Resources can use it. In fact I find it quite natural to use for reporting any 'generic' error during checkpoint; if we want to limit usage to internal we need to make the constructors package-private constructors. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1189940797 From akozlov at openjdk.org Wed May 10 14:50:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 14:50:53 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v5] In-Reply-To: References: Message-ID: <1h2s-b7sop_69bx5PbavykMnj1upfRIfczQOx4Ylwtg=.7ad2d784-5ccf-41a1-96f8-2e6ccd72ca48@github.com> > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Move KeepAlive to separate class, handle interrupts ------------- Changes: - all: https://git.openjdk.org/crac/pull/62/files - new: https://git.openjdk.org/crac/pull/62/files/c2cfae5d..a3b1242e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=62&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=62&range=03-04 Stats: 132 lines in 3 files changed: 75 ins; 52 del; 5 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From akozlov at openjdk.org Wed May 10 14:50:53 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 14:50:53 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v4] In-Reply-To: References: Message-ID: On Fri, 5 May 2023 11:16:50 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Cleanup Thanks for the follow-up! I've added the commit to this PR. ------------- PR Comment: https://git.openjdk.org/crac/pull/62#issuecomment-1542330092 From akozlov at openjdk.org Wed May 10 15:06:45 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:06:45 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v2] In-Reply-To: References: Message-ID: On Wed, 3 May 2023 11:25:11 GMT, Radim Vansa wrote: >> src/hotspot/share/runtime/globals.hpp line 2129: >> >>> 2127: "Throw CheckpointException if uncheckpointable resource handle found")\ >>> 2128: \ >>> 2129: product(bool, CRTrace, true, MANAGEABLE, "Minimal C/R tracing") \ >> >> RESTORE_SETTABLE was meant here? Please don't mix in MANAGEABLE flags into this PR if that was inteded. > > Looking for usages (actually only one) of the flag, it qualifies to be set at any time. // MANAGEABLE flags are writeable external product flags. // They are dynamically writeable through the JDK management interface // (com.sun.management.HotSpotDiagnosticMXBean API) and also through JConsole. // These flags are external exported interface (see CCC). The list of // manageable flags can be queried programmatically through the management // interface. Manageable does not mean "can be set at any time". All product flags are part of the VM interface, but MANAGEABLE is stricter. Actually, the flag should be eliminated and replaced with Unified Logging, so let's just set RESETORE_SETTABLE for this. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/61#discussion_r1190047188 From akozlov at openjdk.org Wed May 10 15:21:58 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:21:58 GMT Subject: git: openjdk/crac: crac: Fix ordering of invocation on Resources Message-ID: <3d284819-82c1-4d44-8563-d8481f190898@openjdk.org> Changeset: ef2437e7 Author: Radim Vansa Committer: Anton Kozlov Date: 2023-05-10 15:21:01 +0000 URL: https://git.openjdk.org/crac/commit/ef2437e7aaaabcbb58366eb84efbb7ebe5934c1f Fix ordering of invocation on Resources Reviewed-by: akozlov ! src/java.base/share/classes/javax/crac/Core.java ! src/java.base/share/classes/jdk/crac/Core.java ! src/java.base/share/classes/jdk/crac/impl/AbstractContextImpl.java ! src/java.base/share/classes/jdk/crac/impl/OrderedContext.java + src/java.base/share/classes/jdk/crac/impl/PriorityContext.java ! src/java.base/share/classes/jdk/internal/crac/JDKContext.java + src/java.base/share/classes/jdk/internal/crac/LoggerContainer.java ! src/java.base/share/classes/jdk/internal/util/jar/PersistentJarFile.java + test/jdk/jdk/crac/ContextOrderTest.java - test/jdk/jdk/crac/LazyProps.java - test/jdk/jdk/crac/ResourceTest.java + test/jdk/jdk/crac/logging.properties From duke at openjdk.org Wed May 10 15:23:25 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 15:23:25 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v4] In-Reply-To: References: Message-ID: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Make CRTrace RESTORE_SETTABLE rather than MANAGEABLE ------------- Changes: - all: https://git.openjdk.org/crac/pull/61/files - new: https://git.openjdk.org/crac/pull/61/files/75ce1b64..920b3be8 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=61&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=61&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/61.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/61/head:pull/61 PR: https://git.openjdk.org/crac/pull/61 From akozlov at openjdk.org Wed May 10 15:24:50 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:24:50 GMT Subject: [crac] RFR: Minor code cleanup and improvements In-Reply-To: References: Message-ID: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> On Wed, 10 May 2023 12:20:07 GMT, Radim Vansa wrote: > Extracted non-essential changes from other PR. src/java.base/share/classes/javax/crac/CheckpointException.java line 50: > 48: * @param message the detail message. > 49: */ > 50: public CheckpointException(String message) { What if we remove this constructor and hide the other one? https://github.com/openjdk/crac/pull/60/files#r1190070472 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1190072341 From akozlov at openjdk.org Wed May 10 15:25:56 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:25:56 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v10] In-Reply-To: References: <0xqgNmVgOMlldlClyg_1cfS-cyPC4as5CRyhdSJUwgU=.3e67c52e-0f7c-4572-9d97-dfb69a0f45ff@github.com> Message-ID: On Wed, 10 May 2023 13:48:08 GMT, Radim Vansa wrote: >> It seems we should have allowed the message in the CheckpointException, to avoid this problem. The only place where CheckpointException with the message is called is recursive checkpoint detection from https://github.com/openjdk/crac/pull/6. >> >> OK for this PR. > > That's not entirely true; CheckpointException is a public API and Resources can use it. In fact I find it quite natural to use for reporting any 'generic' error during checkpoint; if we want to limit usage to internal we need to make the constructors package-private constructors. Hiding all constructors makes sense. I think it will be helpful to have CheckpointException, CheckpointResourceExceptions and descedants something formal, something that can be programmatically queried for the problem. I propose to investigate the possiblity in #64 which touches related pieces. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/60#discussion_r1190070472 From duke at openjdk.org Wed May 10 15:26:00 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 15:26:00 GMT Subject: [crac] Integrated: Fix ordering of invocation on Resources In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 09:01:07 GMT, Radim Vansa wrote: > * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws > * allows to register a resource in a context that did not start beforeCheckpoint invocations yet > * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately > * instead this will be one of the recorded exceptions and the resource has a chance to fire next time > * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though This pull request has now been integrated. Changeset: ef2437e7 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/ef2437e7aaaabcbb58366eb84efbb7ebe5934c1f Stats: 885 lines in 12 files changed: 622 ins; 203 del; 60 mod Fix ordering of invocation on Resources Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/60 From akozlov at openjdk.org Wed May 10 15:40:44 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 15:40:44 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v4] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 15:23:25 GMT, Radim Vansa wrote: >> When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Make CRTrace RESTORE_SETTABLE rather than MANAGEABLE Thank you! LGTM ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/61#pullrequestreview-1420925330 From duke at openjdk.org Wed May 10 16:06:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 16:06:45 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v10] In-Reply-To: References: Message-ID: > Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. > File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: - Merge remote-tracking branch 'origin/crac' into newfd - Fixup - Merge branch 'context_order' into newfd - Revert removing the logging configuration - Make comparator private - Fix build - Minified the set of changes - Fixup - Remove mention of single-threaded notifications - Update docs - ... and 41 more: https://git.openjdk.org/crac/compare/ef2437e7...8fd3566c ------------- Changes: https://git.openjdk.org/crac/pull/43/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=43&range=09 Stats: 999 lines in 33 files changed: 682 ins; 276 del; 41 mod Patch: https://git.openjdk.org/crac/pull/43.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/43/head:pull/43 PR: https://git.openjdk.org/crac/pull/43 From duke at openjdk.org Wed May 10 16:06:45 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 10 May 2023 16:06:45 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v9] In-Reply-To: References: Message-ID: <-aDBFxAmClx32o8FzumSOu80J2TaWlZO7bOTEr6hXHE=.85a43788-8ea4-4920-8c83-9a7d9e6e0c17@github.com> On Mon, 24 Apr 2023 13:27:30 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Provide more information for file descriptors Converting to draft until #60 gets integrated. Also note that RefQueueTest is failing (at least on my machine) because of race documented in that test. ------------- PR Comment: https://git.openjdk.org/crac/pull/43#issuecomment-1540461583 From akozlov at openjdk.org Wed May 10 17:44:44 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 10 May 2023 17:44:44 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v10] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 16:06:45 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Fixup > - Merge branch 'context_order' into newfd > - Revert removing the logging configuration > - Make comparator private > - Fix build > - Minified the set of changes > - Fixup > - Remove mention of single-threaded notifications > - Update docs > - ... and 41 more: https://git.openjdk.org/crac/compare/ef2437e7...8fd3566c I have a follow-up change based on this, but it's rather massive. So I propose to integrate this and for me to create an another PR immediately. Sounds good? ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/43#pullrequestreview-1421118278 From duke at openjdk.org Thu May 11 06:16:13 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 06:16:13 GMT Subject: [crac] RFR: Minor code cleanup and improvements In-Reply-To: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> Message-ID: <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> On Wed, 10 May 2023 15:21:57 GMT, Anton Kozlov wrote: >> Extracted non-essential changes from other PR. > > src/java.base/share/classes/javax/crac/CheckpointException.java line 50: > >> 48: * @param message the detail message. >> 49: */ >> 50: public CheckpointException(String message) { > > What if we remove this constructor and hide the other one? https://github.com/openjdk/crac/pull/60/files#r1190070472 First of all, I think that `javax.crac` should mirror `jdk.crac` API- and docs-wise. It will be much easier when everyone will be able to just change the imports. About the constructor with message: I find a bit confusing when an exception is thrown because of some problem with `criu` but there's no actionable message. I have added a simple 'Native checkpoint failed' but we should probably point user to the dump4.log file. (`criuengine` should also make some sanity checks on permissions but that's another thing). I wouldn't object to hiding it, though. About the one without: `Context` narrows the `throws` to CE/RE and since we expect users to implement Context this would give them no chance to throw checked exceptions, not even the aggregating one. Hiding that won't work. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1190679470 From duke at openjdk.org Thu May 11 06:21:08 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 06:21:08 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v10] In-Reply-To: References: Message-ID: <0bG9Cm3G8opo4NCbEzb3KMLRnuGeinvXf1Dxip-AyNQ=.2ec4c2e9-7436-48a5-a65f-1419c49df2f4@github.com> On Wed, 10 May 2023 16:06:45 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Fixup > - Merge branch 'context_order' into newfd > - Revert removing the logging configuration > - Make comparator private > - Fix build > - Minified the set of changes > - Fixup > - Remove mention of single-threaded notifications > - Update docs > - ... and 41 more: https://git.openjdk.org/crac/compare/ef2437e7...8fd3566c Alright, go for it. ------------- PR Comment: https://git.openjdk.org/crac/pull/43#issuecomment-1543400877 From duke at openjdk.org Thu May 11 06:25:09 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 06:25:09 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v5] In-Reply-To: <1h2s-b7sop_69bx5PbavykMnj1upfRIfczQOx4Ylwtg=.7ad2d784-5ccf-41a1-96f8-2e6ccd72ca48@github.com> References: <1h2s-b7sop_69bx5PbavykMnj1upfRIfczQOx4Ylwtg=.7ad2d784-5ccf-41a1-96f8-2e6ccd72ca48@github.com> Message-ID: On Wed, 10 May 2023 14:50:53 GMT, Anton Kozlov wrote: >> If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. >> >> The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Move KeepAlive to separate class, handle interrupts I am practically approving my own changes, but ok. ------------- Marked as reviewed by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/62#pullrequestreview-1421830947 From duke at openjdk.org Thu May 11 14:28:15 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 14:28:15 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Thu, 27 Apr 2023 11:55:53 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use image under ghcr.io/crac src/hotspot/share/runtime/os.cpp line 2069: > 2067: } > 2068: } > 2069: } No newline at end of file src/java.base/share/classes/java/lang/System.java line 69: > 67: import java.util.concurrent.ConcurrentHashMap; > 68: import java.util.stream.Stream; > 69: Excessive whitespace change. src/java.base/share/classes/java/lang/System.java line 2453: > 2451: }); > 2452: } > 2453: Excessive whitespace change. test/hotspot/jtreg/testlibrary/jittester/conf/exclude.methods.lst line 29: > 27: java/lang/System::loadLibrary(Ljava/lang/String;) > 28: java/lang/System::mapLibraryName(Ljava/lang/String;) > 29: java/lang/System::nanoTime0() Is this change really needed? test/jdk/jdk/crac/java/lang/System/NanoTimeTest.java line 82: > 80: "-e", "LD_PRELOAD=/opt/path-mapping-quiet.so", > 81: "-e", "PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id", > 82: CracBuilder.CONTAINER_NAME, CracBuilder.DOCKER_JAVA); On Fedora 36 x86_64 the testcase does not work for me: Starting docker container: docker run --rm -d --privileged --init --volume /home/azul/azul/crac-git/JTwork/classes/jdk/crac/java/lang/System/NanoTimeTest.d:/cp/0 --volume /home/azul/azul/crac-git/JTwork/classes/test/lib:/cp/1 --volume cr:/cr --volume /home/azul/azul/crac-git/build/linux-x86_64-server-fastdebug/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test -v /tmp/NanoTimeTest-3201524983642970594-boot_id:/fake_boot_id jdk-internal:test-system-nanotime sleep 3600 Starting process to be checkpointed: docker exec -e LD_PRELOAD=/opt/path-mapping-quiet.so -e PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id crac-test /jdk/bin/java -ea -cp /cp/0:/cp/1: -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ NanoTimeTest 0 true /criu: error while loading shared libraries: libbsd.so.0: cannot open shared object file: No such file or directory Exception in thread "main" jdk.crac.CheckpointException ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191248589 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191248975 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191249120 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191249848 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191253534 From duke at openjdk.org Thu May 11 14:32:16 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 14:32:16 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Thu, 27 Apr 2023 11:55:53 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use image under ghcr.io/crac On Fedora 36 x86_64 when I snapshot the image, reboot and restore it with boottime earlier than it was during the snapshot I get a hanging restore: #0 0x00007f43778899b9 in __futex_abstimed_wait_common () from /lib64/libc.so.6 #1 0x00007f437788e983 in __pthread_clockjoin_ex () from /lib64/libc.so.6 #2 0x00007f4377abe655 in CallJavaMainInNewThread (stack_size=1048576, args=0x7ffdd525b5d0) at ../src/java.base/unix/native/libjli/java_md.c:681 #3 0x00007f4377abb7fd in ContinueInNewThread (ifn=0x7ffdd525b6d0, ifn at entry=0x0, threadStackSize=, argc=, argv=0x5641c59c75f8, mode=mode at entry=1, what=0x5641c59c7380 "NanoTime", what at entry=0x0, ret=0) at ../src/java.base/share/native/libjli/java.c:2362 #4 0x00007f4377abe709 in JVMInit (ifn=0x0, ifn at entry=0x7ffdd525b6d0, threadStackSize=, argc=, argv=, mode=mode at entry=1, what=0x0, what at entry=0x5641c59c7380 "NanoTime", ret=) at ../src/java.base/unix/native/libjli/java_md.c:706 #5 0x00007f4377abc4e0 in JLI_Launch (argc=, argc at entry=3, argv=, argv at entry=0x5641c59c72c0, jargc=jargc at entry=0, jargv=jargv at entry=0x5641c3cfa008 , appclassc=appclassc at entry=0, appclassv=appclassv at entry=0x0, fullversion=0x5641c3cf8088 "17-internal+0-adhoc.azul.crac-git", dotversion=0x5641c3cf8079 "0.0", pname=0x5641c3cf8074 "java", lname=0x5641c3cf806c "openjdk", javaargs=0 '\000', cpwildcard=1 '\001', javaw=0 '\000', ergo=0) at ../src/java.base/share/native/libjli/java.c:342 #6 0x00005641c3cf730a in main (argc=, argv=) at ../src/java.base/share/native/launcher/main.c:282 ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1544091438 From akozlov at openjdk.org Thu May 11 14:39:27 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 11 May 2023 14:39:27 GMT Subject: [crac] RFR: Improved open file descriptors tracking [v10] In-Reply-To: References: Message-ID: <2Lfs9rZauU41wjDfYBTKBI-C8QVFCZyLFpw125nMa2M=.617a691c-ae7b-4ffe-b1be-143505718f7a@github.com> On Wed, 10 May 2023 16:06:45 GMT, Radim Vansa wrote: >> Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. >> File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 51 commits: > > - Merge remote-tracking branch 'origin/crac' into newfd > - Fixup > - Merge branch 'context_order' into newfd > - Revert removing the logging configuration > - Make comparator private > - Fix build > - Minified the set of changes > - Fixup > - Remove mention of single-threaded notifications > - Update docs > - ... and 41 more: https://git.openjdk.org/crac/compare/ef2437e7...8fd3566c Thanks for making the progress in this! And make things moving :) ------------- PR Comment: https://git.openjdk.org/crac/pull/43#issuecomment-1544103294 From duke at openjdk.org Thu May 11 14:43:31 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 14:43:31 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Thu, 13 Apr 2023 15:45:50 GMT, Anton Kozlov wrote: >> @AntonKozlov >> >>> Crac-criu does not use restore timens [1] since once a bug in kernel or criu caused timedwait to return immediatelly everytime that is called after restore. I don't remember the bug exactly (already fixed), but I believe it was discussed on this maillist >> >> https://github.com/CRaC/criu/commit/1cb2f4a518a4ae471a1df7a9b540203c1efaf1ba commit is dated July 14, 2020, but the crac-dev archives has earliest mailing list from Sept 2021. Is there some other mailing list this was discussed on? I am interested in understanding the problem that prompted not to use timens in criu. >> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. >> >>> In general, we should not to depend on very obscure linux abillities, as this reduce chances we'd be able to run on something rather than linux. >> >> I don't think timens can be put in the category of obscure linux ability. It has even made its way into container runtime spec: https://github.com/opencontainers/runtime-spec/pull/1151. > >> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. > > AFAIK the bug is fixed, but I see no point of relying on OS here. Is there one? Timens that is not changed by CRIU provides correct values for our nanoTime() [1]. > >> The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine > > [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/System.html#nanoTime() Upstream criu does provide the time namespace as stated by @AntonKozlov: CLOCK_MONOTONIC=301.735134591 CLOCK_BOOTTIME=301.735155494 CLOCK_MONOTONIC=302.735345917 CLOCK_BOOTTIME=302.735358109 Warn (compel/arch/x86/src/lib/infect.c:356): Will restore 7757 with interrupted system call [1]+ Killed ./clock_gettime restore: CLOCK_MONOTONIC=302.803360137 CLOCK_BOOTTIME=302.803373299 restore: CLOCK_MONOTONIC=302.806677876 CLOCK_BOOTTIME=302.806696098 I do not see why JVM should reimplement what CRIU already does. One can solve that in the future when CRaC is really going to run on non-Linux system. One will need to port or reimplement there CRIU in the first place. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1544109680 From duke at openjdk.org Thu May 11 14:43:35 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 14:43:35 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Thu, 27 Apr 2023 11:55:53 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Use image under ghcr.io/crac test/jdk/jdk/crac/java/lang/System/NanoTimeTest.java line 121: > 119: assertGTE(boottimeAfter, boottimeBefore + 86_400_000, "Boottime was not changed"); > 120: RuntimeMXBean runtimeMX = ManagementFactory.getRuntimeMXBean(); > 121: assertGTE(runtimeMX.getUptime(), 0L,"VM Uptime is negative!"); whitespace ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191281406 From duke at openjdk.org Thu May 11 14:57:28 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 14:57:28 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Thu, 11 May 2023 14:25:49 GMT, Jan Kratochvil wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Use image under ghcr.io/crac > > test/jdk/jdk/crac/java/lang/System/NanoTimeTest.java line 82: > >> 80: "-e", "LD_PRELOAD=/opt/path-mapping-quiet.so", >> 81: "-e", "PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id", >> 82: CracBuilder.CONTAINER_NAME, CracBuilder.DOCKER_JAVA); > > On Fedora 36 x86_64 the testcase does not work for me: > > Starting docker container: > docker run --rm -d --privileged --init --volume /home/azul/azul/crac-git/JTwork/classes/jdk/crac/java/lang/System/NanoTimeTest.d:/cp/0 --volume /home/azul/azul/crac-git/JTwork/classes/test/lib:/cp/1 --volume cr:/cr --volume /home/azul/azul/crac-git/build/linux-x86_64-server-fastdebug/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test -v /tmp/NanoTimeTest-3201524983642970594-boot_id:/fake_boot_id jdk-internal:test-system-nanotime sleep 3600 > Starting process to be checkpointed: > docker exec -e LD_PRELOAD=/opt/path-mapping-quiet.so -e PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id crac-test /jdk/bin/java -ea -cp /cp/0:/cp/1: -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ NanoTimeTest 0 true > /criu: error while loading shared libraries: libbsd.so.0: cannot open shared object file: No such file or directory > Exception in thread "main" jdk.crac.CheckpointException Are you using your own build of CRIU? Looks like you have CRIU build with libbsd support. When you remove `libbsd-devel` and rebuild CRIU this should work. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191307243 From duke at openjdk.org Thu May 11 15:09:21 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 11 May 2023 15:09:21 GMT Subject: [crac] Integrated: Improved open file descriptors tracking In-Reply-To: References: Message-ID: <3tCebXARUaioV_D2GWZg7GPnugyz5940VFHZtg3jFPc=.f3b1808f-9501-4284-a1db-ee153b347f14@github.com> On Tue, 31 Jan 2023 10:25:36 GMT, Radim Vansa wrote: > Tracks `java.io.FileDescriptor` instances as CRaC resource; before checkpoint these are reported and if not allow-listed (e.g. as opened as standard descriptors) an exception is thrown. Further investigation can use system property `jdk.crac.collect-fd-stacktraces=true` to record origin of those file descriptors. > File descriptors claimed in Java code are passed to native; native code checks all open file descriptors and reports error if there's an unexpected FD that is not included in the list passed previously. This pull request has now been integrated. Changeset: 4b0dc2dc Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/4b0dc2dc9722945579c9772b335a44fa79f1729f Stats: 999 lines in 33 files changed: 682 ins; 276 del; 41 mod Improved open file descriptors tracking Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/43 From duke at openjdk.org Thu May 11 15:18:17 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 11 May 2023 15:18:17 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Thu, 11 May 2023 14:54:58 GMT, Radim Vansa wrote: >> test/jdk/jdk/crac/java/lang/System/NanoTimeTest.java line 82: >> >>> 80: "-e", "LD_PRELOAD=/opt/path-mapping-quiet.so", >>> 81: "-e", "PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id", >>> 82: CracBuilder.CONTAINER_NAME, CracBuilder.DOCKER_JAVA); >> >> On Fedora 36 x86_64 the testcase does not work for me: >> >> Starting docker container: >> docker run --rm -d --privileged --init --volume /home/azul/azul/crac-git/JTwork/classes/jdk/crac/java/lang/System/NanoTimeTest.d:/cp/0 --volume /home/azul/azul/crac-git/JTwork/classes/test/lib:/cp/1 --volume cr:/cr --volume /home/azul/azul/crac-git/build/linux-x86_64-server-fastdebug/jdk/lib/criu:/criu --env CRAC_CRIU_PATH=/criu --name crac-test -v /tmp/NanoTimeTest-3201524983642970594-boot_id:/fake_boot_id jdk-internal:test-system-nanotime sleep 3600 >> Starting process to be checkpointed: >> docker exec -e LD_PRELOAD=/opt/path-mapping-quiet.so -e PATH_MAPPING=/proc/sys/kernel/random/boot_id:/fake_boot_id crac-test /jdk/bin/java -ea -cp /cp/0:/cp/1: -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ NanoTimeTest 0 true >> /criu: error while loading shared libraries: libbsd.so.0: cannot open shared object file: No such file or directory >> Exception in thread "main" jdk.crac.CheckpointException > > Are you using your own build of CRIU? Looks like you have CRIU build with libbsd support. When you remove `libbsd-devel` and rebuild CRIU this should work. Yes. When I remove `libbsd-devel` I get: /usr/bin/ld: /home/azul/azul/criu-git/criu/apparmor.c:127: undefined reference to `strlcpy' /home/azul/azul/criu-git/criu/crtools.c:130: undefined reference to `setproctitle_init' Besides that no package should break (despite only its testcase) when some additional system feature is available. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191337406 From akozlov at openjdk.org Thu May 11 16:11:51 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 11 May 2023 16:11:51 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 12:49:03 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert removing the logging configuration I got this after some unrelated modifications (FileDescriptor.beforeCheckpoint uses lambda), and I apparently get auto-deadlock with a single thread involved: "main" #1 prio=5 os_prio=0 cpu=88.95ms elapsed=21.61s tid=0x00007fd670025160 nid=0x27228a in Object.wait() [0x00007fd6747fd000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base at 17-internal/Native Method) - waiting on <0x0000000418002088> (a jdk.internal.crac.JDKContext) at java.lang.Object.wait(java.base at 17-internal/Object.java:338) at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) - locked <0x0000000418002088> (a jdk.internal.crac.JDKContext) at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:97) at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:170) at java.lang.ref.Cleaner.register(java.base at 17-internal/Cleaner.java:220) at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:90) at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:270) at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(java.base at 17-internal/MethodHandleNatives.java:281) at java.lang.invoke.MethodHandleNatives.linkCallSite(java.base at 17-internal/MethodHandleNatives.java:271) at java.io.FileDescriptor$Resource.beforeCheckpoint(java.base at 17-internal/FileDescriptor.java:74) at jdk.crac.impl.PriorityContext$SubContext.invokeBeforeCheckpoint(java.base at 17-internal/PriorityContext.java:107) at jdk.crac.impl.OrderedContext.runBeforeCheckpoint(java.base at 17-internal/OrderedContext.java:70) at jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:81) at jdk.crac.impl.AbstractContextImpl.invokeBeforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:41) at jdk.crac.impl.PriorityContext.runBeforeCheckpoint(java.base at 17-internal/PriorityContext.java:70) at jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:81) at jdk.internal.crac.JDKContext.beforeCheckpoint(java.base at 17-internal/JDKContext.java:85) at jdk.crac.impl.AbstractContextImpl.invokeBeforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:41) at jdk.crac.impl.OrderedContext.runBeforeCheckpoint(java.base at 17-internal/OrderedContext.java:70) at jdk.crac.impl.AbstractContextImpl.beforeCheckpoint(java.base at 17-internal/AbstractContextImpl.java:81) at jdk.crac.Core.checkpointRestore1(java.base at 17-internal/Core.java:116) at jdk.crac.Core.checkpointRestore(java.base at 17-internal/Core.java:256) - locked <0x0000000418002118> (a java.lang.Object) at jdk.crac.Core.checkpointRestore(java.base at 17-internal/Core.java:241) at CheckpointWithOpenFdsTest.exec(CheckpointWithOpenFdsTest.java:49) at jdk.test.lib.crac.CracTest.run(CracTest.java:157) at jdk.test.lib.crac.CracTest.main(CracTest.java:89) ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1544268184 From duke at openjdk.org Fri May 12 02:00:13 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Fri, 12 May 2023 02:00:13 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Thu, 11 May 2023 15:15:23 GMT, Jan Kratochvil wrote: >> Are you using your own build of CRIU? Looks like you have CRIU build with libbsd support. When you remove `libbsd-devel` and rebuild CRIU this should work. > > Yes. When I remove `libbsd-devel` I get: > > /usr/bin/ld: /home/azul/azul/criu-git/criu/apparmor.c:127: undefined reference to `strlcpy' > /home/azul/azul/criu-git/criu/crtools.c:130: undefined reference to `setproctitle_init' > > Besides that no package should break (despite only its testcase) when some additional system feature is available. Besides discussion of this patch vs. native CRIU timens support I find easier (YMMV) and less error-prone to just use simple `LD_PRELOAD`+`RTLD_NEXT` to simulate the system reboots: https://stackoverflow.com/a/6083624/2995591 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191835421 From duke at openjdk.org Fri May 12 05:49:28 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 05:49:28 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v7] In-Reply-To: References: Message-ID: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: - Merge branch 'crac' into nanotime - Fix whitespaces - Use image under ghcr.io/crac - Ensure monotonicity for the same boot - Set nanotime only if bootid changes - Reset nanotime offset before calculating it again - Correct time since restore - Merge remote-tracking branch 'origin/crac' into nanotime - Adjust System.nanoTime() to keep consistent time origin after restore - Merge remote-tracking branch 'origin/crac' into test-crac-java - ... and 8 more: https://git.openjdk.org/crac/compare/ef2437e7...87d19a12 ------------- Changes: https://git.openjdk.org/crac/pull/53/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=53&range=06 Stats: 295 lines in 7 files changed: 268 ins; 0 del; 27 mod Patch: https://git.openjdk.org/crac/pull/53.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/53/head:pull/53 PR: https://git.openjdk.org/crac/pull/53 From duke at openjdk.org Fri May 12 06:04:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 06:04:12 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Fri, 12 May 2023 01:56:58 GMT, Jan Kratochvil wrote: >> Yes. When I remove `libbsd-devel` I get: >> >> /usr/bin/ld: /home/azul/azul/criu-git/criu/apparmor.c:127: undefined reference to `strlcpy' >> /home/azul/azul/criu-git/criu/crtools.c:130: undefined reference to `setproctitle_init' >> >> Besides that no package should break (despite only its testcase) when some additional system feature is available. > > Besides discussion of this patch vs. native CRIU timens support I find easier (YMMV) and less error-prone to just use simple `LD_PRELOAD`+`RTLD_NEXT` to simulate the system reboots: https://stackoverflow.com/a/6083624/2995591 I agree with your line of thinking, and since this test actually uses custom docker image I can add those libraries to the base image. But in general it's a problem of criu packaging with implicit dependencies, officially installed binary should get this handled in dependency management. Since those few functions from libbsd are reimplemented in criu, we could just disable the dependency in our fork. However we'll run into the same problem with libnftables - I found that with the nftables support the unprivileged restore as root in Docker container does not work anymore. And that got compiled in only because of the -dev library present, there's no compile time switch. The error you describe can be solved with `git clean -f -x`, regrettably `make clean` does not do its job very well. Not sure what function are you suggesting to replace? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1191938800 From duke at openjdk.org Fri May 12 06:48:24 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 06:48:24 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 12:49:03 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert removing the logging configuration Yes, I've ran into this couple of times as well, when invoking lambda or using method reference after CLEANERS priority. The solution is either to remove those dynamically linked call sites, or initialize those call sites eagerly before. It asks for a general solution, though, adding a priority level for cleaners used only for call sites (a special method would be required in the cleaners). I'll try to sketch that out. One other idea I had, as I prefer exceptions over deadlocks, is to detect this. Since you don't want to add any methods to the API that would be generally usable by *any* context, it might be still useful to let the C/R code in Core set a static var/thread-local flag in `AbstractContextImpl` and this method would throw rather than block when it detects that it is the same thread. That breaks encapsulation a bit, though - we could have a resource that sets that flag, but it should run before the 'normal' resources, which leads me to the inversion of JDKContext and Global Context and `EARLY` priority I've suggested before. ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1545254988 From duke at openjdk.org Fri May 12 07:25:29 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 07:25:29 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: On Thu, 11 May 2023 16:04:32 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert removing the logging configuration > > I got this after some unrelated modifications (FileDescriptor.beforeCheckpoint uses lambda), and I apparently get auto-deadlock with a single thread involved: > > > "main" #1 prio=5 os_prio=0 cpu=88.95ms elapsed=21.61s tid=0x00007fd670025160 nid=0x27228a in Object.wait() [0x00007fd6747fd000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base at 17-internal/Native Method) > - waiting on <0x0000000418002088> (a jdk.internal.crac.JDKContext) > at java.lang.Object.wait(java.base at 17-internal/Object.java:338) > at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) > at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) > - locked <0x0000000418002088> (a jdk.internal.crac.JDKContext) > at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:97) > at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:170) > at java.lang.ref.Cleaner.register(java.base at 17-internal/Cleaner.java:220) > at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:90) > at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) > at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) > at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:270) > at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) > at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) > at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) > at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) > at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) > at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(java.base at 17-internal/MethodHandleNatives.java:281) > at java.lang.invoke.MethodHandleNatives.linkCallSite(java.base at 17-internal/MethodHandleNatives.java:271) > at java.io.FileDescriptor$Resource.beforeCheckpoint(java.base at 17-internal/FileDescriptor.java:74) > at jdk.crac.impl.PriorityContext$SubContext.invokeBeforeCheckpoint(java.base at 17-internal/PriorityContext.java:107) > at jdk.crac.impl.OrderedContext.runBeforeCheckpoint(java.base at 17-in... @AntonKozlov Please discuss the general solution in here: https://github.com/openjdk/crac/pull/66 ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1545295344 From duke at openjdk.org Fri May 12 07:30:08 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 07:30:08 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: <12Kg_ikrM_osO_MwuZGpj6sXIiFd7MLu8JCdAzWXulA=.4937fec8-49e6-4d9c-99e4-0fa89ec3deee@github.com> On Thu, 11 May 2023 14:39:21 GMT, Jan Kratochvil wrote: >>> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. >> >> AFAIK the bug is fixed, but I see no point of relying on OS here. Is there one? Timens that is not changed by CRIU provides correct values for our nanoTime() [1]. >> >>> The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine >> >> [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/System.html#nanoTime() > > Upstream criu does provide the time namespace as stated by @AntonKozlov: > > CLOCK_MONOTONIC=301.735134591 CLOCK_BOOTTIME=301.735155494 > CLOCK_MONOTONIC=302.735345917 CLOCK_BOOTTIME=302.735358109 > Warn (compel/arch/x86/src/lib/infect.c:356): Will restore 7757 with interrupted system call > [1]+ Killed ./clock_gettime > restore: > CLOCK_MONOTONIC=302.803360137 CLOCK_BOOTTIME=302.803373299 > restore: > CLOCK_MONOTONIC=302.806677876 CLOCK_BOOTTIME=302.806696098 > > I do not see why JVM should reimplement what CRIU already does. One can solve that in the future when CRaC is really going to run on non-Linux system. One will need to port or reimplement there CRIU in the first place. @jankratochvil After trying to run C/R with fewer privileges in containers I am starting to see the value of moving away from CRIU; it does more than JVM usually needs, but this requires elevated privileges. Leaving it all at once would be a tremendous endeavour, though, so slowly dropping functionality, even through adding some switches into CRaC fork to disable some parts makes sense to me. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1545300602 From duke at openjdk.org Fri May 12 07:43:13 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 07:43:13 GMT Subject: [crac] RFR: List open FDs through reading /proc/self/fd Message-ID: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> Previously the code was iterating through all possible FD values, up to highest allowed FD number, and required allocation of possibly huge array. Reading /proc/self/fd into a compact array is both more memory efficient and does not require excessive syscalls. ------------- Commit messages: - List open FDs through reading /proc/self/fd Changes: https://git.openjdk.org/crac/pull/67/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=67&range=00 Stats: 97 lines in 1 file changed: 21 ins; 18 del; 58 mod Patch: https://git.openjdk.org/crac/pull/67.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/67/head:pull/67 PR: https://git.openjdk.org/crac/pull/67 From duke at openjdk.org Fri May 12 08:46:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 08:46:14 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v7] In-Reply-To: References: Message-ID: On Fri, 12 May 2023 05:49:28 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: > > - Merge branch 'crac' into nanotime > - Fix whitespaces > - Use image under ghcr.io/crac > - Ensure monotonicity for the same boot > - Set nanotime only if bootid changes > - Reset nanotime offset before calculating it again > - Correct time since restore > - Merge remote-tracking branch 'origin/crac' into nanotime > - Adjust System.nanoTime() to keep consistent time origin after restore > - Merge remote-tracking branch 'origin/crac' into test-crac-java > - ... and 8 more: https://git.openjdk.org/crac/compare/ef2437e7...87d19a12 Actually thanks to rebooting I was able to run into the issue CI encountered - using negative monotonic offset could end up setting the time to negative value. I've fixed this by shifting the monotonic offset for the checkpoint rather than for restore in this case. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1545391284 From duke at openjdk.org Fri May 12 08:46:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 08:46:14 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v8] In-Reply-To: References: Message-ID: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: - Merge branch 'crac' into nanotime - Do not use negative monotonic offset - Merge branch 'crac' into nanotime - Fix whitespaces - Use image under ghcr.io/crac - Ensure monotonicity for the same boot - Set nanotime only if bootid changes - Reset nanotime offset before calculating it again - Correct time since restore - Merge remote-tracking branch 'origin/crac' into nanotime - ... and 10 more: https://git.openjdk.org/crac/compare/4b0dc2dc...4c850312 ------------- Changes: https://git.openjdk.org/crac/pull/53/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=53&range=07 Stats: 301 lines in 7 files changed: 274 ins; 0 del; 27 mod Patch: https://git.openjdk.org/crac/pull/53.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/53/head:pull/53 PR: https://git.openjdk.org/crac/pull/53 From duke at openjdk.org Fri May 12 10:59:07 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 10:59:07 GMT Subject: [crac] RFR: Register CallSite cleaners with higher priority Message-ID: <1xzPSqM6zHVVoheq8x_qH2mJCAXMn03sl2Gnkuoj9_s=.73899947-0d48-46b4-beb8-30260433f4a2@github.com> Cleaners for CallSites cannot be registered with regular CLEANERS priority as this would hang/throw exceptions any time a lambda or method reference is used during or after processing this resource priority class; therefore we postpone as the last priority. ------------- Commit messages: - Register CallSite cleaners with higher priority Changes: https://git.openjdk.org/crac/pull/66/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=66&range=00 Stats: 21 lines in 4 files changed: 17 ins; 0 del; 4 mod Patch: https://git.openjdk.org/crac/pull/66.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/66/head:pull/66 PR: https://git.openjdk.org/crac/pull/66 From akozlov at openjdk.org Fri May 12 10:59:07 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 12 May 2023 10:59:07 GMT Subject: [crac] RFR: Register CallSite cleaners with higher priority In-Reply-To: <1xzPSqM6zHVVoheq8x_qH2mJCAXMn03sl2Gnkuoj9_s=.73899947-0d48-46b4-beb8-30260433f4a2@github.com> References: <1xzPSqM6zHVVoheq8x_qH2mJCAXMn03sl2Gnkuoj9_s=.73899947-0d48-46b4-beb8-30260433f4a2@github.com> Message-ID: On Fri, 12 May 2023 07:21:36 GMT, Radim Vansa wrote: > Cleaners for CallSites cannot be registered with regular CLEANERS priority as this would hang/throw exceptions any time a lambda or method reference is used during or after processing this resource priority class; therefore we postpone as the last priority. LGTM as a workaround, but indeed we need to generalize the problem and think about possible solutions. src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java line 94: > 92: // CleanerFactory class) until cleanup is performed. > 93: new CleanerImpl.PhantomCleanableRef(cs, CleanerFactory.cleaner(), > 94: newContext, JDKResource.Priority.CALL_SITES); This is based on impl details. Is it possible to use e.g. a `CleanerFactory.criticalCleaner().register(...)` that will use different priorities? ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/66#pullrequestreview-1424208186 PR Review Comment: https://git.openjdk.org/crac/pull/66#discussion_r1192189130 From duke at openjdk.org Fri May 12 10:59:07 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 10:59:07 GMT Subject: [crac] RFR: Register CallSite cleaners with higher priority In-Reply-To: References: <1xzPSqM6zHVVoheq8x_qH2mJCAXMn03sl2Gnkuoj9_s=.73899947-0d48-46b4-beb8-30260433f4a2@github.com> Message-ID: <6ybYhEXxMj8U_8OH8luQSy_KqbUBk0LI7OuCB3WtuHU=.06fe882f-6dc4-43d2-8624-5fd576721f43@github.com> On Fri, 12 May 2023 10:20:41 GMT, Anton Kozlov wrote: >> Cleaners for CallSites cannot be registered with regular CLEANERS priority as this would hang/throw exceptions any time a lambda or method reference is used during or after processing this resource priority class; therefore we postpone as the last priority. > > src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java line 94: > >> 92: // CleanerFactory class) until cleanup is performed. >> 93: new CleanerImpl.PhantomCleanableRef(cs, CleanerFactory.cleaner(), >> 94: newContext, JDKResource.Priority.CALL_SITES); > > This is based on impl details. Is it possible to use e.g. a `CleanerFactory.criticalCleaner().register(...)` that will use different priorities? 1) `CleanerFactory` could get a new method since it is in `jdk.internal.ref`, but such alternative implementation would spin up another thread. 2) `Cleaner` is a public interface, so I cannot add an alternative method. Regrettably it's also a final class, so I can't make CleanerFactory wrap the common cleaner and override just the `register()` method. I find that the info that we need to do something different already leaks to MethodHandleNatives and since your suggested solution has impact on memory usage, I would rather choose breaking the encapsulation. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/66#discussion_r1192216893 From duke at openjdk.org Fri May 12 11:02:34 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 11:02:34 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies Message-ID: When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: * numeric file descriptor * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) * keywords FIFO and SOCKET that match pipes and sockets The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore ------------- Commit messages: - Handle open file descriptors with configurable policies Changes: https://git.openjdk.org/crac/pull/69/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=69&range=00 Stats: 1072 lines in 15 files changed: 1035 ins; 15 del; 22 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From duke at openjdk.org Fri May 12 11:02:34 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 11:02:34 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies In-Reply-To: References: Message-ID: On Fri, 12 May 2023 10:54:50 GMT, Radim Vansa wrote: > When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. > > These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) > * keywords FIFO and SOCKET that match pipes and sockets > > The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore Some workarounds in this PR could be reverted if #66 gets integrated. I've also moved OpenFDPolicies from `jdk.crac` to `jdk.crac.impl` at the last moment. While the static methods should stay in impl, I wonder if we should publish the property names and enum values in `jdk.crac` and make such 'configuration' a part of the API. In the current state I don't know where should the documentation for these policies land. ------------- PR Comment: https://git.openjdk.org/crac/pull/69#issuecomment-1545561787 From akozlov at openjdk.org Fri May 12 11:15:52 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 12 May 2023 11:15:52 GMT Subject: git: openjdk/crac: crac: Register CallSite cleaners with higher priority Message-ID: <10a3198b-5e43-4309-b260-ab94676e8710@openjdk.org> Changeset: 4d9c616f Author: Radim Vansa Committer: Anton Kozlov Date: 2023-05-12 11:15:18 +0000 URL: https://git.openjdk.org/crac/commit/4d9c616f2d2ca6770e20f3ee90de640c0f40a7c4 Register CallSite cleaners with higher priority Reviewed-by: akozlov ! src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java ! src/java.base/share/classes/java/lang/ref/Cleaner.java ! src/java.base/share/classes/jdk/internal/crac/JDKResource.java ! src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java From duke at openjdk.org Fri May 12 11:18:13 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 11:18:13 GMT Subject: [crac] Integrated: Register CallSite cleaners with higher priority In-Reply-To: <1xzPSqM6zHVVoheq8x_qH2mJCAXMn03sl2Gnkuoj9_s=.73899947-0d48-46b4-beb8-30260433f4a2@github.com> References: <1xzPSqM6zHVVoheq8x_qH2mJCAXMn03sl2Gnkuoj9_s=.73899947-0d48-46b4-beb8-30260433f4a2@github.com> Message-ID: On Fri, 12 May 2023 07:21:36 GMT, Radim Vansa wrote: > Cleaners for CallSites cannot be registered with regular CLEANERS priority as this would hang/throw exceptions any time a lambda or method reference is used during or after processing this resource priority class; therefore we postpone as the last priority. This pull request has now been integrated. Changeset: 4d9c616f Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/4d9c616f2d2ca6770e20f3ee90de640c0f40a7c4 Stats: 21 lines in 4 files changed: 17 ins; 0 del; 4 mod Register CallSite cleaners with higher priority Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/66 From akozlov at openjdk.org Fri May 12 11:24:17 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 12 May 2023 11:24:17 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 12:49:03 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert removing the logging configuration Stack trace after merging #66 => the jdk.crac code also cannot use lambdas: "main" #1 prio=5 os_prio=0 cpu=89.78ms elapsed=16.55s tid=0x00007f0a68025160 nid=0x28513b in Object.wait() [0x00007f0a6d9fd000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base at 17-internal/Native Method) - waiting on <0x000000041a802088> (a jdk.internal.crac.JDKContext) at java.lang.Object.wait(java.base at 17-internal/Object.java:338) at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) - locked <0x000000041a802088> (a jdk.internal.crac.JDKContext) at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:97) at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:173) at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:93) at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:262) at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(java.base at 17-internal/MethodHandleNatives.java:285) at java.lang.invoke.MethodHandleNatives.linkCallSite(java.base at 17-internal/MethodHandleNatives.java:275) at jdk.internal.crac.JDKContext.getClaimedFds(java.base at 17-internal/JDKContext.java:102) at jdk.crac.Core.checkpointRestore1(java.base at 17-internal/Core.java:125) at jdk.crac.Core.checkpointRestore(java.base at 17-internal/Core.java:256) - locked <0x000000041a802118> (a java.lang.Object) at jdk.crac.Core.checkpointRestore(java.base at 17-internal/Core.java:241) at CheckpointWithOpenFdsTest.exec(CheckpointWithOpenFdsTest.java:49) at jdk.test.lib.crac.CracTest.run(CracTest.java:157) at jdk.test.lib.crac.CracTest.main(CracTest.java:89) ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1545588281 From duke at openjdk.org Fri May 12 11:32:24 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 11:32:24 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: <99MBuA8VL5N63UGKEIWFawnWQQ6lWjAo8cBVYz0VGag=.4f51f14c-1a3a-491a-8355-9112630642e0@github.com> On Wed, 10 May 2023 12:49:03 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert removing the logging configuration Did you change this method? In #43 this uses loop rather than streams. I think that in the Core code the need to remove lambdas is unavoidable; the scope of what's happening after all notifications and before the `afterRestore` is minimal. This is similar to early stages of the JDK boot; in some classes you would run into NPE had you tried to use lambdas (e.g. when initializing `JDKContext.PRIORITY_COMPARATOR`). ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1545597739 From duke at openjdk.org Fri May 12 13:15:26 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 13:15:26 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v2] In-Reply-To: References: Message-ID: > When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. > > These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) > * keywords FIFO and SOCKET that match pipes and sockets > > The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Simplify workarounds in SimpleConsoleLogger. - Merge branch 'crac' into newfd-policies - Handle open file descriptors with configurable policies When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: * numeric file descriptor * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) * keywords FIFO and SOCKET that match pipes and sockets The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore ------------- Changes: - all: https://git.openjdk.org/crac/pull/69/files - new: https://git.openjdk.org/crac/pull/69/files/be29c783..37cc0ac4 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=69&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=69&range=00-01 Stats: 30 lines in 6 files changed: 17 ins; 6 del; 7 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From duke at openjdk.org Fri May 12 13:29:08 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 13:29:08 GMT Subject: [crac] RFR: Handle open file descriptors with configurable policies [v3] In-Reply-To: References: Message-ID: > When the application does not close some file descriptors through Resources we can use `jdk.crac.fd-policy.checkpoint` and `jdk.crac.fd-policy.restore` to configure the behaviour. > > These properties can specify a list of File.pathSeparator-separated key=value pairs, where the key can be one of: > > * numeric file descriptor > * path using 'glob' pattern matching (see FileSystem.getPathMatcher() for details) > * keywords FIFO and SOCKET that match pipes and sockets > > The value should match one of possible values from OpenFDPolicies.BeforeCheckpoint and OpenFDPolicies.AfterRestore Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Effectively revert previous commit: Initialize logger in ------------- Changes: - all: https://git.openjdk.org/crac/pull/69/files - new: https://git.openjdk.org/crac/pull/69/files/37cc0ac4..d9d9e48c Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=69&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=69&range=01-02 Stats: 7 lines in 1 file changed: 4 ins; 0 del; 3 mod Patch: https://git.openjdk.org/crac/pull/69.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/69/head:pull/69 PR: https://git.openjdk.org/crac/pull/69 From duke at openjdk.org Fri May 12 13:57:16 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 13:57:16 GMT Subject: [crac] RFR: Synchronize concurrent clean() in PhantomCleanableRef Message-ID: Fixes failures in RefQueueTest and JarFileFactoryCacheTest. ------------- Commit messages: - Synchronize concurrent clean() in PhantomCleanableRef Changes: https://git.openjdk.org/crac/pull/70/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=70&range=00 Stats: 21 lines in 3 files changed: 16 ins; 4 del; 1 mod Patch: https://git.openjdk.org/crac/pull/70.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/70/head:pull/70 PR: https://git.openjdk.org/crac/pull/70 From akozlov at openjdk.org Fri May 12 14:34:15 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 12 May 2023 14:34:15 GMT Subject: [crac] RFR: Synchronize concurrent clean() in PhantomCleanableRef In-Reply-To: References: Message-ID: On Fri, 12 May 2023 13:50:34 GMT, Radim Vansa wrote: > Fixes failures in RefQueueTest and JarFileFactoryCacheTest. Another option is to synchronize with the thread performing cleanup, e.g. blocking that, and run cleaning for all uncleaned refs, which are in the ad-hoc double linked list. Do these options has pros/cons compared to each other? ------------- PR Comment: https://git.openjdk.org/crac/pull/70#issuecomment-1545838863 From akozlov at openjdk.org Fri May 12 14:44:16 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 12 May 2023 14:44:16 GMT Subject: [crac] RFR: Fix ordering of invocation on Resources [v13] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 12:49:03 GMT, Radim Vansa wrote: >> * keeps the original handling of exceptions: afterRestore is called even if beforeCheckpoint throws >> * allows to register a resource in a context that did not start beforeCheckpoint invocations yet >> * registering resource in previous/running context fails the checkpoint but does not trigger exception immediately >> * instead this will be one of the recorded exceptions and the resource has a chance to fire next time >> * allowed registration of resources can be invoked from other thread without deadlock; illegal registration can deadlock, though > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Revert removing the logging configuration Yes, I did. But in general, these limitations do not look very pleasant for programming. I think we are trying to find a perfect policy for context when there are too many sets of requirements: - We want some resource to be certainly called, if beforeCheckpoint does a critical clean-up (e.g. wiping security keys), not doing that is a bug. But these don't care about ordering with other resources. In this case, running notification right on registartion looks like an option (as you've proposed in some of the threads), or collecting such late registrations and run all of them, before - Some of the resources that do care about ordering does not necessarily care about being certailny called, like those which do optimization clean up like clearing extra cache which does affect correctness. I'm leaning toward introduction of a few JDKContext variants that will provide different guarantees for being registered resources. Global context still to accomodate them. GlobalContext: - PriorityContext - ... - Cleanup // everything that needs clean up, immediate runs of the notification - ...User Resources... The scheme actually is not something specific to JDK, user code may have something similar or different, depending on the needs. ------------- PR Comment: https://git.openjdk.org/crac/pull/60#issuecomment-1545853147 From duke at openjdk.org Fri May 12 14:48:05 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 12 May 2023 14:48:05 GMT Subject: [crac] RFR: Synchronize concurrent clean() in PhantomCleanableRef In-Reply-To: References: Message-ID: On Fri, 12 May 2023 13:50:34 GMT, Radim Vansa wrote: > Fixes failures in RefQueueTest and JarFileFactoryCacheTest. I think that you won't avoid waiting for the cleanup thread - there would have to be a point in time where we'd switch a global (or per-resource?) flag to enter the blocking behavior (cannot be on all the time, because the checkpoint may never happen), and then wait for the in-flight resources to complete. It would be possible to achieve the synchronization through switching the flag on in the cleaner thread, but that only means waiting for it as well. All in all it seems much more complicated. Are you aware of any pros? Maybe self-deadlocking rather than deadlock in 2 threads (that's why I had to add the eager initialization for FileDispatcherImpl, C/R was already at CLEANERS when the cleaner thread tried to initialize it and register with NORMAL priority) - if that's any advantage. ------------- PR Comment: https://git.openjdk.org/crac/pull/70#issuecomment-1545859013 From duke at openjdk.org Fri May 12 14:59:22 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Fri, 12 May 2023 14:59:22 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Fri, 12 May 2023 06:01:45 GMT, Radim Vansa wrote: >> Besides discussion of this patch vs. native CRIU timens support I find easier (YMMV) and less error-prone to just use simple `LD_PRELOAD`+`RTLD_NEXT` to simulate the system reboots: https://stackoverflow.com/a/6083624/2995591 > > I agree with your line of thinking, and since this test actually uses custom docker image I can add those libraries to the base image. But in general it's a problem of criu packaging with implicit dependencies, officially installed binary should get this handled in dependency management. Since those few functions from libbsd are reimplemented in criu, we could just disable the dependency in our fork. > However we'll run into the same problem with libnftables - I found that with the nftables support the unprivileged restore as root in Docker container does not work anymore. And that got compiled in only because of the -dev library present, there's no compile time switch. > > The error you describe can be solved with `git clean -f -x`, regrettably `make clean` does not do its job very well. > > Not sure what function are you suggesting to replace? I mean to provide normal LD_PRELOAD library and then one no longer needs to deal with any containers with their associated problems discussed above. [libvirtual.c.txt](https://github.com/openjdk/crac/files/11465377/libvirtual.c.txt) [clock_gettime.cpp.txt](https://github.com/openjdk/crac/files/11465400/clock_gettime.cpp.txt) rm -rf dir;mkdir dir;LD_PRELOAD=$PWD/libvirtual.so CLOCK_MONOTONIC=10 BOOT_ID=$PWD/fake_boot_id ./clock_gettime &p=$!;sleep 2;sudo ./build/linux-x86_64-server-fastdebug/jdk/lib/criu dump --shell-job -t $p -D dir CLOCK_MONOTONIC=88547. 87286504 CLOCK_BOOTTIME=88547. 87304536 CLOCK_MONOTONIC=88548. 87433771 CLOCK_BOOTTIME=88548. 87445951 CLOCK_MONOTONIC=88549. 87564393 CLOCK_BOOTTIME=88549. 87573859 Warn (criu/pages-compress.c:216): decompress: Can't open 'pages-1.comp.img' for dfd=3, errno=2 [--- that is some bug] CLOCK_MONOTONIC=10. 82447 CLOCK_BOOTTIME=88556.959892251 CLOCK_MONOTONIC=11. 179010 CLOCK_BOOTTIME=88557.959969582 CLOCK_MONOTONIC=12. 264924 CLOCK_BOOTTIME=88558.960056407 With original JVM it correctly crashes: # Internal Error (../../src/hotspot/share/runtime/thread.cpp:2470), pid=74780, tid=74781 # assert(false) failed: unexpected time moving backwards detected in JavaThread::sleep() With your patch it runs although somehow slowly. It may need some adjustments, I wrote the library as a proof of concept to prevent the containers. One cannot use the library to test native CRIU timens as that happens at kernel-level. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1192482912 From duke at openjdk.org Sat May 13 13:25:22 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Sat, 13 May 2023 13:25:22 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v15] In-Reply-To: References: Message-ID: <8YL9mDsEGdgoB2GrjqWd0Zvk-0mI-j19qcnX4KIbQ80=.89acb87e-a72e-4dff-8dca-d3d128c5cedb@github.com> > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: IF_ASSERT -> DEBUG_ONLY - suggested by Anton Kozlov. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/b22bb537..c79621da Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=14 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=13-14 Stats: 6 lines in 1 file changed: 0 ins; 3 del; 3 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Sat May 13 13:36:16 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Sat, 13 May 2023 13:36:16 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v16] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 84 commits: - Merge branch 'crac-altstack' into crac-altstack-cpu-cpuexplicit-strip - Merge branch 'crac' into crac-altstack - IF_ASSERT -> DEBUG_ONLY - suggested by Anton Kozlov. - -altstack - Merge branch 'crac-altstack' into crac-altstack-tunables - Merge branch 'crac' into crac-altstack - Merge branch 'crac' into crac-altstack - 2b0f56b7: - ec18a208: - Fix the glibc SSE2 exception. - ... and 74 more: https://git.openjdk.org/crac/compare/4d9c616f...04a11ef3 ------------- Changes: https://git.openjdk.org/crac/pull/41/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=41&range=15 Stats: 717 lines in 19 files changed: 685 ins; 12 del; 20 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Sun May 14 04:06:23 2023 From: duke at openjdk.org (joeylee) Date: Sun, 14 May 2023 04:06:23 GMT Subject: [crac] RFR: Linux file system watcher support Message-ID: `inotify` monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. FileWatcherAfterRestoreTest verifies watcher service works after restore. FileWatcherTest verifies automatic closing inotiify fd The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest ------------- Commit messages: - Linux file system watcher support Changes: https://git.openjdk.org/crac/pull/71/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=71&range=00 Stats: 345 lines in 4 files changed: 314 ins; 25 del; 6 mod Patch: https://git.openjdk.org/crac/pull/71.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/71/head:pull/71 PR: https://git.openjdk.org/crac/pull/71 From duke at openjdk.org Sun May 14 09:56:07 2023 From: duke at openjdk.org (joeylee) Date: Sun, 14 May 2023 09:56:07 GMT Subject: [crac] Withdrawn: Linux file system watcher support In-Reply-To: References: Message-ID: On Sun, 14 May 2023 03:49:38 GMT, joeylee wrote: > `inotify` monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. > > FileWatcherAfterRestoreTest verifies watcher service works after restore. > FileWatcherTest verifies automatic closing inotiify fd > > The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/crac/pull/71 From duke at openjdk.org Sun May 14 10:07:28 2023 From: duke at openjdk.org (joeylee) Date: Sun, 14 May 2023 10:07:28 GMT Subject: [crac] RFR: Linux file system watcher support Message-ID: inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. FileWatcherAfterRestoreTest verifies watcher service works after restore. FileWatcherTest verifies automatic closing inotiify fd The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest ------------- Commit messages: - Linux file system watcher support Changes: https://git.openjdk.org/crac/pull/72/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=72&range=00 Stats: 340 lines in 4 files changed: 309 ins; 25 del; 6 mod Patch: https://git.openjdk.org/crac/pull/72.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/72/head:pull/72 PR: https://git.openjdk.org/crac/pull/72 From duke at openjdk.org Mon May 15 06:02:15 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 15 May 2023 06:02:15 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Fri, 12 May 2023 14:56:11 GMT, Jan Kratochvil wrote: >> I agree with your line of thinking, and since this test actually uses custom docker image I can add those libraries to the base image. But in general it's a problem of criu packaging with implicit dependencies, officially installed binary should get this handled in dependency management. Since those few functions from libbsd are reimplemented in criu, we could just disable the dependency in our fork. >> However we'll run into the same problem with libnftables - I found that with the nftables support the unprivileged restore as root in Docker container does not work anymore. And that got compiled in only because of the -dev library present, there's no compile time switch. >> >> The error you describe can be solved with `git clean -f -x`, regrettably `make clean` does not do its job very well. >> >> Not sure what function are you suggesting to replace? > > I mean to provide normal LD_PRELOAD library and then one no longer needs to deal with any containers with their associated problems discussed above. > [libvirtual.c.txt](https://github.com/openjdk/crac/files/11465377/libvirtual.c.txt) > [clock_gettime.cpp.txt](https://github.com/openjdk/crac/files/11465400/clock_gettime.cpp.txt) > > rm -rf dir;mkdir dir;LD_PRELOAD=$PWD/libvirtual.so CLOCK_MONOTONIC=10 BOOT_ID=$PWD/fake_boot_id ./clock_gettime &p=$!;sleep 2;sudo ./build/linux-x86_64-server-fastdebug/jdk/lib/criu dump --shell-job -t $p -D dir > CLOCK_MONOTONIC=88547. 87286504 CLOCK_BOOTTIME=88547. 87304536 > CLOCK_MONOTONIC=88548. 87433771 CLOCK_BOOTTIME=88548. 87445951 > CLOCK_MONOTONIC=88549. 87564393 CLOCK_BOOTTIME=88549. 87573859 > Warn (criu/pages-compress.c:216): decompress: Can't open 'pages-1.comp.img' for dfd=3, errno=2 [--- that is some bug] > CLOCK_MONOTONIC=10. 82447 CLOCK_BOOTTIME=88556.959892251 > CLOCK_MONOTONIC=11. 179010 CLOCK_BOOTTIME=88557.959969582 > CLOCK_MONOTONIC=12. 264924 CLOCK_BOOTTIME=88558.960056407 > > With original JVM it correctly crashes: > > # Internal Error (../../src/hotspot/share/runtime/thread.cpp:2470), pid=74780, tid=74781 > # assert(false) failed: unexpected time moving backwards detected in JavaThread::sleep() > > With your patch it runs although somehow slowly. > It may need some adjustments, I wrote the library as a proof of concept to prevent the containers. > One cannot use the library to test native CRIU timens as that happens at kernel-level. How would you compile that library as part of of the jtreg? Just do `Runtime.getRuntime().exec("gcc", ...)` during the test? Can you create some native libraries as a part of the test library build? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1193356916 From duke at openjdk.org Mon May 15 07:10:16 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 15 May 2023 07:10:16 GMT Subject: [crac] RFR: Linux file system watcher support In-Reply-To: References: Message-ID: <140Zj7oo7kRyFYqSARRLaxNAucyQa2RCy7rcRrEuYGo=.22f2507b-09ef-4512-a8ac-96e5c4df6e65@github.com> On Sun, 14 May 2023 10:00:26 GMT, joeylee wrote: > inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. > > FileWatcherAfterRestoreTest verifies watcher service works after restore. > FileWatcherTest verifies automatic closing inotiify fd > > The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest Thanks for the contribution! I wonder if you have an example of code where it's useful to automatically suspend the service but leave `WatchKey` management up to the user. To me this would be either fully transparent or fully managed, breaking the cycle that polls the events and closing the service as well. src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 176: > 174: private final long address; > 175: private volatile CheckpointRestoreState checkpointState = CheckpointRestoreState.NORMAL_OPERATION; > 176: private final Object checkpointLock = new Object(); Do we need another object just for locking, or would synchronizing on the Poller instance be sufficient? src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 185: > 183: this.address = unsafe.allocateMemory(BUFFER_SIZE); > 184: this.socketpair = new int[2]; > 185: initFDs(); When `initFD` throws an exception the native memory would stay allocated (you've flipped the order of allocation and FD initialization). src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 368: > 366: try { > 367: do { > 368: nReady = poll(ifd, socketpair[0]); When the `poll` is followed by a C/R we call it second time, ignoring the old value. Will this forget any events? Obviously all the events happening when the application is in snapshot will be lost, but I wonder whether we should queue and replay anything that's already recorded. In the future (not necessarily in this PR) it would be nice to detect if anything happened when the application was in snapshot and generate events to keep its view up to date. src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 501: > 499: synchronized (checkpointLock) { > 500: checkpointState = CheckpointRestoreState.CHECKPOINT_TRANSITION; > 501: write(socketpair[1], address, 1); Any reason to do this directly rather than calling `wakeup()`? It seems that there's some handling for translating errno into messages... src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 509: > 507: } > 508: if (checkpointState == CheckpointRestoreState.CHECKPOINT_ERROR) { > 509: throw new IllegalSelectorException(); This is not the right exception type. src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 519: > 517: return; > 518: } > 519: synchronized (checkpointLock) { I wonder, if the `initFD` throws during restore, shouldn't we throw here as well? I know you were inspired in EPollSelector, but let's give it a thought. test/jdk/jdk/crac/fileDescriptors/FileWatcherAfterRestoreTest.java line 48: > 46: @Override > 47: public void exec() throws Exception { > 48: WatchService watchService = FileSystems.getDefault().newWatchService(); Could you call `close()` on this service at the end of the test to check that it works as it should after C/R? test/jdk/jdk/crac/fileDescriptors/FileWatcherAfterRestoreTest.java line 54: > 52: Path directory = Paths.get(System.getProperty("user.dir")); > 53: WatchKey key = directory.register(watchService, StandardWatchEventKinds.ENTRY_CREATE); > 54: Thread.sleep(200); Why the sleep? test/jdk/jdk/crac/fileDescriptors/FileWatcherTest.java line 36: > 34: * @run driver jdk.test.lib.crac.CracTest > 35: */ > 36: public class FileWatcherTest implements CracTest { Does this test anything more than what `FileWatcherAfterRestoreTest` does anyway? ------------- PR Review: https://git.openjdk.org/crac/pull/72#pullrequestreview-1425849060 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193381783 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193383268 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193392586 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193394358 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193394899 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193400167 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193405625 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193401546 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1193400802 From duke at openjdk.org Mon May 15 13:37:19 2023 From: duke at openjdk.org (joeylee) Date: Mon, 15 May 2023 13:37:19 GMT Subject: [crac] RFR: Linux file system watcher support In-Reply-To: <140Zj7oo7kRyFYqSARRLaxNAucyQa2RCy7rcRrEuYGo=.22f2507b-09ef-4512-a8ac-96e5c4df6e65@github.com> References: <140Zj7oo7kRyFYqSARRLaxNAucyQa2RCy7rcRrEuYGo=.22f2507b-09ef-4512-a8ac-96e5c4df6e65@github.com> Message-ID: <3D8DGTzPv-Fg_c1N81jZkTa8DHS8P3LLyRgnA_wBYso=.c051ea20-911d-4300-85a1-2c8a69ddbbdd@github.com> On Mon, 15 May 2023 07:07:53 GMT, Radim Vansa wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > Thanks for the contribution! > > I wonder if you have an example of code where it's useful to automatically suspend the service but leave `WatchKey` management up to the user. To me this would be either fully transparent or fully managed, breaking the cycle that polls the events and closing the service as well. @rvansa Thanks for your comments, I working on my patch based on these recommendations. I noticed a occasional crash in FileWatcherAfterRestoreTest when restore the Process from dump. As in [github action check](https://github.com/joeyleeeeeee97/crac/actions/runs/4971753576/jobs/8896779795). It crashes on where main thread is exiting, this is the [crash log](https://gist.github.com/joeyleeeeeee97/42711c35e02b0d2c530bad22aa2c5dd0). I could reproduce it locally with `hile ~/jtreg/jtreg/bin/jtreg -v ./test/jdk/jdk/crac/fileDescriptors/FileWatcherAfterRestoreTest.java; do :; done` and `/home/ubuntu/jdk/build/linux-x86_64-server-release/images/jdk/bin/java -XX:CRaCRestoreFrom=cr`. Could I get some inputs on where my updates might caused the crash? Or could it be a separate issue? ------------- PR Comment: https://git.openjdk.org/crac/pull/72#issuecomment-1547870645 From duke at openjdk.org Mon May 15 13:44:35 2023 From: duke at openjdk.org (Radim Vansa) Date: Mon, 15 May 2023 13:44:35 GMT Subject: [crac] RFR: Linux file system watcher support In-Reply-To: References: Message-ID: On Sun, 14 May 2023 10:00:26 GMT, joeylee wrote: > inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. > > FileWatcherAfterRestoreTest verifies watcher service works after restore. > FileWatcherTest verifies automatic closing inotiify fd > > The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest Can't tell out of top of my head - the crash.log has the VM error log printed just after calling `ls` rather than `java -ea -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ FileWatcherAfterRestoreTest` - have you edited it? Could you try to reproduce with a `CONF=fastdebug` build? ------------- PR Comment: https://git.openjdk.org/crac/pull/72#issuecomment-1547882244 From akozlov at openjdk.org Mon May 15 14:10:17 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 15 May 2023 14:10:17 GMT Subject: [crac] RFR: Synchronize concurrent clean() in PhantomCleanableRef In-Reply-To: References: Message-ID: On Fri, 12 May 2023 14:45:15 GMT, Radim Vansa wrote: > I think that you won't avoid waiting for the cleanup thread - there would have to be a point in time where we'd switch a global (or per-resource?) flag to enter the blocking behavior (cannot be on all the time, because the checkpoint may never happen), and then wait for the in-flight resources to complete. It would be possible to achieve the synchronization through switching the flag on in the cleaner thread, but that only means waiting for it as well. Agree, syncronization is required. What about adding a special synchronizing Ref added to the cleaner queue, so checkpoint can only proceed once the ref "cleanup" has been started? And which will block the cleaner until restore complete. > All in all it seems much more complicated. Are you aware of any pros? Asking becauses synchronizing a single cleaner looks more efficient compared to syncronizing each PhantomCleanableRef. The performCleanup() is proposed to have an additional synchronized section, which is required only for CRaC. ------------- PR Comment: https://git.openjdk.org/crac/pull/70#issuecomment-1547927252 From akozlov at openjdk.org Mon May 15 14:14:18 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 15 May 2023 14:14:18 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v4] In-Reply-To: References: Message-ID: On Wed, 10 May 2023 15:23:25 GMT, Radim Vansa wrote: >> When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Make CRTrace RESTORE_SETTABLE rather than MANAGEABLE Oops, there is a conflict. Could you please update the PR? ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1547939974 From duke at openjdk.org Mon May 15 14:34:19 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Mon, 15 May 2023 14:34:19 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v8] In-Reply-To: References: Message-ID: On Fri, 12 May 2023 08:46:14 GMT, Radim Vansa wrote: >> There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. > > Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: > > - Merge branch 'crac' into nanotime > - Do not use negative monotonic offset > - Merge branch 'crac' into nanotime > - Fix whitespaces > - Use image under ghcr.io/crac > - Ensure monotonicity for the same boot > - Set nanotime only if bootid changes > - Reset nanotime offset before calculating it again > - Correct time since restore > - Merge remote-tracking branch 'origin/crac' into nanotime > - ... and 10 more: https://git.openjdk.org/crac/compare/4b0dc2dc...4c850312 src/hotspot/os/linux/os_linux.cpp line 6400: > 6398: } > 6399: > 6400: bool os::read_bootid(char *dest, size_t size) { The `size` parameter is not useful when it must be always `UUID_LENGTH + 1`. src/hotspot/os/linux/os_linux.cpp line 6421: > 6419: } > 6420: rd += r; > 6421: } while (rd < size - 1); I would expect it would be good to check the file really has the expected length and the UUID ends with `'\n'`. src/hotspot/os/linux/os_linux.cpp line 6422: > 6420: rd += r; > 6421: } while (rd < size - 1); > 6422: dest[size - 1] = '\0'; When the function supports only `UUID_LENGTH` files it does not make much sense to zero-terminate it. It is easier to rather do `memcmp` of `UUID_LENGTH' bytes. src/hotspot/os/linux/os_linux.cpp line 6423: > 6421: } while (rd < size - 1); > 6422: dest[size - 1] = '\0'; > 6423: ::close(fd); Error check of `close()`? src/hotspot/share/runtime/os.cpp line 2051: > 2049: // won't offset the time based on wall-clock time as this change in monotonic > 2050: // time is likely intentional. > 2051: if (!read_bootid(buf, sizeof(buf)) || strncmp(buf, checkpoint_bootid, UUID_LENGTH) != 0) { Currently `read_bootid` always zero-terminates the string. Therefore it is easier to use just `strcmp`. When I proposed removing the zero-termination one would rather use `memcmp` instead of `strncmp`. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1193925337 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1193927193 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1193928930 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1193931013 PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1193930500 From duke at openjdk.org Tue May 16 03:36:10 2023 From: duke at openjdk.org (joeylee) Date: Tue, 16 May 2023 03:36:10 GMT Subject: [crac] RFR: Linux file system watcher support In-Reply-To: References: Message-ID: <4atKXqdR0-3eWpnFmtU0OlXmi0jOErOa6xVeYx780QU=.7726f3ac-b87f-4c67-8b3f-a0f3eb88691e@github.com> On Mon, 15 May 2023 13:41:44 GMT, Radim Vansa wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > Can't tell out of top of my head - the crash.log has the VM error log printed just after calling `ls` rather than `java -ea -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ FileWatcherAfterRestoreTest` - have you edited it? > > Could you try to reproduce with a `CONF=fastdebug` build? @rvansa , sorry, looks like I accidentally paste some of the bash histories in gist, the error is caused by `java -ea -XX:CRaCCheckpointTo=cr jdk.test.lib.crac.CracTest __run_test__ FileWatcherAfterRestoreTest` . And here is the `fastdebug` level [crash log ](https://gist.github.com/joeyleeeeeee97/42711c35e02b0d2c530bad22aa2c5dd0). And I am not editing any related logic, this happens after the restored process finished execution and start exiting. V [libjvm.so+0x8d4275] oop RawAccessBarrier<594020ul>::oop_load(void*)+0x15 V [libjvm.so+0x8d4905] AccessInternal::PostRuntimeDispatch, (AccessInternal::BarrierType)2, 594020ul>::oop_access_barrier(void*)+0x15 V [libjvm.so+0x137d20b] ObjectMonitor::object_peek() const+0x1b V [libjvm.so+0x1798354] ObjectSynchronizer::release_monitors_owned_by_thread(JavaThread*)+0xf4 V [libjvm.so+0x180f447] JavaThread::exit(bool, JavaThread::ExitType)+0x8d7 V [libjvm.so+0xe91a4d] jni_DetachCurrentThread+0x8d C [libjli.so+0x4c71] JavaMain+0xc41 C [libjli.so+0x7a19] ThreadJavaMain+0x9 ------------- PR Comment: https://git.openjdk.org/crac/pull/72#issuecomment-1548926529 From duke at openjdk.org Tue May 16 07:30:26 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 16 May 2023 07:30:26 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v5] In-Reply-To: References: Message-ID: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'crac' into vmoptions - Make CRTrace RESTORE_SETTABLE rather than MANAGEABLE - Fixup - Use RESTORE_SETTABLE on JVM flags * Fail early when using non settable flags * CracBuilder fix: don't use classpath during restore - Support updating MANAGEABLE JVM options during restore When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. ------------- Changes: https://git.openjdk.org/crac/pull/61/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=61&range=04 Stats: 254 lines in 12 files changed: 190 ins; 10 del; 54 mod Patch: https://git.openjdk.org/crac/pull/61.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/61/head:pull/61 PR: https://git.openjdk.org/crac/pull/61 From rvansa at azul.com Tue May 16 08:17:17 2023 From: rvansa at azul.com (Radim Vansa) Date: Tue, 16 May 2023 10:17:17 +0200 Subject: Replacing mmap with userfaultfd Message-ID: <42cdb9e5-6f51-d2b4-9859-6451eb155c67@azul.com> Hi all, I was exploring alternative options to support repeated checkpoints [1] and I'd like to share results for review and further suggestions. Currently the CRaC fork of CRIU uses by default --mmap-page-image during restore - that significantly speeds up loading the image, but if another checkpoint were performed we would keep the mapping to directory with previous checkpoint. In [1] I've temporarily mapped those pages read-only, blocking any thread that would access it, copied the data to newly allocated memory and then remapped the copy to the original address space. This solution has two downsides: 1) there is asymmetry in the design as one part (mapping) is handled in CRIU while the 'fixup' before next checkpoint happens in JVM 2) handling of writes to those write-protected pages happens through handling SIGSEGV. JVM already supports user handling of signal (in addition to its own), but from POSIX view the state of process after SIGSEGV is undefined so this is rather crude solution. Newer Linux kernels support handling missing pages through `userfaultfd`. This would be a great solution for (2), but it is not possible to register this handling on memory mapped to file (except tmpfs on recent kernels). As for loading all memory through userfaultfd on demand, CRIU already supports that with --lazy-pages, where a server process is started and the starting application fetches pages as it needs them. If we're looking into a more symmetric (1) solution we could implant the handler directly to the JVM process. When testing that, though, I found that there's still a big gap between this lazy loading and simple mmaped file. To check the possible performance I've refactored and modified [2] - the source code for the test is in attachment. In order to run this I had to enable non-root userfaultfd, set number of huge pages in my system and generate a 512 MB testing file: echo 1 | sudo tee /proc/sys/vm/unprivileged_userfaultfd echo 256 | sudo tee /proc/sys/vm/nr_hugepages dd if=/dev/random of=/home/rvansa/random bs=1048576 count=512 After compiling and running this on my machine I got pretty consistently these results: Tested file has 512 MB (536870912 bytes) mmaped file: Page size: 4096 Num pages: 131072 TOTAL 35222423, AVG 268, MIN 12, MAX 47249 ... VALUE -920981768 -------------------------------------- Userfaultfd, regular pages: Page size: 4096 Num pages: 131072 TOTAL 729833293, AVG 5568, MIN 4358, MAX 126324 ... VALUE -920981768 -------------------------------------- Userfaultfd, huge pages: Page size: 2097152 Num pages: 256 TOTAL 104413970, AVG 407867, MIN 351902, MAX 2019050 ... VALUE 899929091 This shows that userfaultfd with 4kB pages (within single process) is 20x slower, and with 2MB pages is 3x slower than simply reading the mmapped file. The most critical factor is probably the number of context switches and the latency of all of that; the need to copy the data multiple times still slows the hugepages version significantly, though. We could use a hybrid approach that would proactively load the data, trying to limit number of pagefaults that actually need to be handled; I am not sure how could I POC such approach quickly, though. In any case I think that getting to performance comparable to mmaped files would be very hard. Thanks for any further suggestions. Radim [1] https://github.com/openjdk/crac/pull/57 [2] https://noahdesu.github.io/2016/10/10/userfaultfd-hello-world.html -------------- next part -------------- A non-text attachment was scrubbed... Name: experiment.c Type: text/x-csrc Size: 7192 bytes Desc: not available URL: From duke at openjdk.org Tue May 16 08:33:17 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 16 May 2023 08:33:17 GMT Subject: [crac] RFR: Linux file system watcher support In-Reply-To: References: Message-ID: On Sun, 14 May 2023 10:00:26 GMT, joeylee wrote: > inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. > > FileWatcherAfterRestoreTest verifies watcher service works after restore. > FileWatcherTest verifies automatic closing inotiify fd > > The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest I was hoping that a fastdebug build with assertions on would fail somewhere earlier but it is not the case. I would suspect that since this NULL dereference is happening when the thread already exits there's an earlier event that goes wrong. ------------- PR Comment: https://git.openjdk.org/crac/pull/72#issuecomment-1549229817 From duke at openjdk.org Tue May 16 08:40:14 2023 From: duke at openjdk.org (joeylee) Date: Tue, 16 May 2023 08:40:14 GMT Subject: [crac] RFR: Linux file system watcher support In-Reply-To: References: Message-ID: On Tue, 16 May 2023 08:30:41 GMT, Radim Vansa wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > I was hoping that a fastdebug build with assertions on would fail somewhere earlier but it is not the case. I would suspect that since this NULL dereference is happening when the thread already exits there's an earlier event that goes wrong. @rvansa Yes, it seems a obj monitor in in use list is wrong. This is my [gdb debug history](https://gist.github.com/joeyleeeeeee97/96213d4dfd655894d86f7e36531f03f2) I noticed (gdb) p ObjectSynchronizer::_in_use_list->_head->_next_om->_next_om->_object->_obj $9 = (oop *) 0x36383631706d6574 ------------- PR Comment: https://git.openjdk.org/crac/pull/72#issuecomment-1549238512 From duke at openjdk.org Tue May 16 09:29:13 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 16 May 2023 09:29:13 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v9] In-Reply-To: References: Message-ID: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 22 commits: - Merge branch 'crac' into nanotime - More checks when reading boot ID - Merge branch 'crac' into nanotime - Do not use negative monotonic offset - Merge branch 'crac' into nanotime - Fix whitespaces - Use image under ghcr.io/crac - Ensure monotonicity for the same boot - Set nanotime only if bootid changes - Reset nanotime offset before calculating it again - ... and 12 more: https://git.openjdk.org/crac/compare/4d9c616f...0b45adeb ------------- Changes: https://git.openjdk.org/crac/pull/53/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=53&range=08 Stats: 312 lines in 7 files changed: 285 ins; 0 del; 27 mod Patch: https://git.openjdk.org/crac/pull/53.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/53/head:pull/53 PR: https://git.openjdk.org/crac/pull/53 From duke at openjdk.org Tue May 16 09:29:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 16 May 2023 09:29:14 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Thu, 11 May 2023 14:39:21 GMT, Jan Kratochvil wrote: >>> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. >> >> AFAIK the bug is fixed, but I see no point of relying on OS here. Is there one? Timens that is not changed by CRIU provides correct values for our nanoTime() [1]. >> >>> The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine >> >> [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/System.html#nanoTime() > > Upstream criu does provide the time namespace as stated by @AntonKozlov: > > CLOCK_MONOTONIC=301.735134591 CLOCK_BOOTTIME=301.735155494 > CLOCK_MONOTONIC=302.735345917 CLOCK_BOOTTIME=302.735358109 > Warn (compel/arch/x86/src/lib/infect.c:356): Will restore 7757 with interrupted system call > [1]+ Killed ./clock_gettime > restore: > CLOCK_MONOTONIC=302.803360137 CLOCK_BOOTTIME=302.803373299 > restore: > CLOCK_MONOTONIC=302.806677876 CLOCK_BOOTTIME=302.806696098 > > I do not see why JVM should reimplement what CRIU already does. One can solve that in the future when CRaC is really going to run on non-Linux system. One will need to port or reimplement there CRIU in the first place. Thanks for the comments on boot ID @jankratochvil , incorporated. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1549313713 From rvansa at azul.com Tue May 16 10:53:12 2023 From: rvansa at azul.com (Radim Vansa) Date: Tue, 16 May 2023 12:53:12 +0200 Subject: Replacing mmap with userfaultfd In-Reply-To: <42cdb9e5-6f51-d2b4-9859-6451eb155c67@azul.com> References: <42cdb9e5-6f51-d2b4-9859-6451eb155c67@azul.com> Message-ID: <64fefae5-bf9f-580a-6704-68209ad3482b@azul.com> I've yet experimented a bit with two things: First was to remove the copy by allocating a new anonymous memory, reading the data directly into that, remapping to the original location and waking up the thread. This proved to be even slower than UFFDIO_COPY in both cases. The second was to use the MAP_POPULATE flag with mmapped-file: the mmap operation then took a significantly longer time (almost as if this was synchronous rather than hint) but the total time for mmap and reading would be lower. Results from attached test: Tested file has 512 MB (536870912 bytes) mmap call took 872 ns mmapped file: Page size: 4096 Num pages: 131072 TOTAL 35825555 (33845372), AVG 258, MIN 12, MAX 31720 ... VALUE -920981768 -------------------------------------- mmap call took 20329669 ns mmapped file, populated: Page size: 4096 Num pages: 131072 TOTAL 8158203 (6215554), AVG 47, MIN 12, MAX 8432 ... VALUE -920981768 -------------------------------------- Userfaultfd, regular pages, copy: Page size: 4096 Num pages: 131072 TOTAL 673848374 (671470568), AVG 5122, MIN 4305, MAX 285817 ... VALUE -920981768 -------------------------------------- Userfaultfd, regular pages, remap: Page size: 4096 Num pages: 131072 TOTAL 1126162943 (1123755653), AVG 8573, MIN 7070, MAX 203940 ... VALUE -920981768 -------------------------------------- Userfaultfd, huge pages, copy: Page size: 2097152 Num pages: 256 TOTAL 105654950 (105609575), AVG 412537, MIN 356688, MAX 2008166 ... VALUE 899929091 -------------------------------------- Userfaultfd, huge pages, remap: Page size: 2097152 Num pages: 256 TOTAL 165370794 (165326676), AVG 645807, MIN 575171, MAX 3516155 ... VALUE 899929091 -------------------------------------- Radim On 16. 05. 23 10:17, Radim Vansa wrote: > Caution: This email originated from outside of the organization. Do > not click links or open attachments unless you recognize the sender > and know the content is safe. > > > Hi all, > > I was exploring alternative options to support repeated checkpoints [1] > and I'd like to share results for review and further suggestions. > > Currently the CRaC fork of CRIU uses by default --mmap-page-image during > restore - that significantly speeds up loading the image, but if another > checkpoint were performed we would keep the mapping to directory with > previous checkpoint. In [1] I've temporarily mapped those pages > read-only, blocking any thread that would access it, copied the data to > newly allocated memory and then remapped the copy to the original > address space. This solution has two downsides: > > 1) there is asymmetry in the design as one part (mapping) is handled in > CRIU while the 'fixup' before next checkpoint happens in JVM > > 2) handling of writes to those write-protected pages happens through > handling SIGSEGV. JVM already supports user handling of signal (in > addition to its own), but from POSIX view the state of process after > SIGSEGV is undefined so this is rather crude solution. > > Newer Linux kernels support handling missing pages through > `userfaultfd`. This would be a great solution for (2), but it is not > possible to register this handling on memory mapped to file (except > tmpfs on recent kernels). As for loading all memory through userfaultfd > on demand, CRIU already supports that with --lazy-pages, where a server > process is started and the starting application fetches pages as it > needs them. If we're looking into a more symmetric (1) solution we could > implant the handler directly to the JVM process. When testing that, > though, I found that there's still a big gap between this lazy loading > and simple mmaped file. To check the possible performance I've > refactored and modified [2] - the source code for the test is in > attachment. > > In order to run this I had to enable non-root userfaultfd, set number of > huge pages in my system and generate a 512 MB testing file: > > echo 1 | sudo tee /proc/sys/vm/unprivileged_userfaultfd > echo 256 | sudo tee /proc/sys/vm/nr_hugepages > dd if=/dev/random of=/home/rvansa/random bs=1048576 count=512 > > After compiling and running this on my machine I got pretty consistently > these results: > > Tested file has 512 MB (536870912 bytes) > mmaped file: > Page size: 4096 > Num pages: 131072 > TOTAL 35222423, AVG 268, MIN 12, MAX 47249 ... VALUE -920981768 > -------------------------------------- > Userfaultfd, regular pages: > Page size: 4096 > Num pages: 131072 > TOTAL 729833293, AVG 5568, MIN 4358, MAX 126324 ... VALUE -920981768 > -------------------------------------- > Userfaultfd, huge pages: > Page size: 2097152 > Num pages: 256 > TOTAL 104413970, AVG 407867, MIN 351902, MAX 2019050 ... VALUE 899929091 > > This shows that userfaultfd with 4kB pages (within single process) is > 20x slower, and with 2MB pages is 3x slower than simply reading the > mmapped file. The most critical factor is probably the number of context > switches and the latency of all of that; the need to copy the data > multiple times still slows the hugepages version significantly, though. > > We could use a hybrid approach that would proactively load the data, > trying to limit number of pagefaults that actually need to be handled; I > am not sure how could I POC such approach quickly, though. In any case I > think that getting to performance comparable to mmaped files would be > very hard. > > Thanks for any further suggestions. > > Radim > > [1] https://github.com/openjdk/crac/pull/57 > > [2] https://noahdesu.github.io/2016/10/10/userfaultfd-hello-world.html -------------- next part -------------- A non-text attachment was scrubbed... Name: experiment.c Type: text/x-csrc Size: 9092 bytes Desc: not available URL: From duke at openjdk.org Tue May 16 11:56:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 16 May 2023 11:56:12 GMT Subject: [crac] RFR: Synchronize concurrent clean() in PhantomCleanableRef In-Reply-To: References: Message-ID: On Fri, 12 May 2023 13:50:34 GMT, Radim Vansa wrote: > Fixes failures in RefQueueTest and JarFileFactoryCacheTest. Alternative implementation: https://github.com/openjdk/crac/pull/73 ------------- PR Comment: https://git.openjdk.org/crac/pull/70#issuecomment-1549518062 From duke at openjdk.org Tue May 16 11:59:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 16 May 2023 11:59:12 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications Message-ID: We block the cleaner thread to prevent race conditions between this thread and checkpointing thread invoking clean(). When the cleanup starts in cleaner thread the checkpoint will skip it, but without waiting for the cleanup to finish (which might be critical for the checkpoint, e.g. closing FDs). The limitation is that code performing C/R must not wait on any task completed by the cleaner. ------------- Commit messages: - Prevent concurrent cleanup by cleaner thread and checkpoint notifications Changes: https://git.openjdk.org/crac/pull/73/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=73&range=00 Stats: 73 lines in 3 files changed: 64 ins; 7 del; 2 mod Patch: https://git.openjdk.org/crac/pull/73.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/73/head:pull/73 PR: https://git.openjdk.org/crac/pull/73 From duke at openjdk.org Tue May 16 11:59:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 16 May 2023 11:59:12 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: References: Message-ID: On Tue, 16 May 2023 11:52:54 GMT, Radim Vansa wrote: > We block the cleaner thread to prevent race conditions between this thread and checkpointing thread invoking clean(). > When the cleanup starts in cleaner thread the checkpoint will skip it, but without waiting for the cleanup to finish (which might be critical for the checkpoint, e.g. closing FDs). > The limitation is that code performing C/R must not wait on any task completed by the cleaner. This is an alternative to https://github.com/openjdk/crac/pull/70 ------------- PR Comment: https://git.openjdk.org/crac/pull/73#issuecomment-1549518441 From duke at openjdk.org Tue May 16 12:29:16 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 16 May 2023 12:29:16 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Mon, 15 May 2023 05:59:11 GMT, Radim Vansa wrote: >> I mean to provide normal LD_PRELOAD library and then one no longer needs to deal with any containers with their associated problems discussed above. >> [libvirtual.c.txt](https://github.com/openjdk/crac/files/11465377/libvirtual.c.txt) >> [clock_gettime.cpp.txt](https://github.com/openjdk/crac/files/11465400/clock_gettime.cpp.txt) >> >> rm -rf dir;mkdir dir;LD_PRELOAD=$PWD/libvirtual.so CLOCK_MONOTONIC=10 BOOT_ID=$PWD/fake_boot_id ./clock_gettime &p=$!;sleep 2;sudo ./build/linux-x86_64-server-fastdebug/jdk/lib/criu dump --shell-job -t $p -D dir >> CLOCK_MONOTONIC=88547. 87286504 CLOCK_BOOTTIME=88547. 87304536 >> CLOCK_MONOTONIC=88548. 87433771 CLOCK_BOOTTIME=88548. 87445951 >> CLOCK_MONOTONIC=88549. 87564393 CLOCK_BOOTTIME=88549. 87573859 >> Warn (criu/pages-compress.c:216): decompress: Can't open 'pages-1.comp.img' for dfd=3, errno=2 [--- that is some bug] >> CLOCK_MONOTONIC=10. 82447 CLOCK_BOOTTIME=88556.959892251 >> CLOCK_MONOTONIC=11. 179010 CLOCK_BOOTTIME=88557.959969582 >> CLOCK_MONOTONIC=12. 264924 CLOCK_BOOTTIME=88558.960056407 >> >> With original JVM it correctly crashes: >> >> # Internal Error (../../src/hotspot/share/runtime/thread.cpp:2470), pid=74780, tid=74781 >> # assert(false) failed: unexpected time moving backwards detected in JavaThread::sleep() >> >> With your patch it runs although somehow slowly. >> It may need some adjustments, I wrote the library as a proof of concept to prevent the containers. >> One cannot use the library to test native CRIU timens as that happens at kernel-level. > > How would you compile that library as part of of the jtreg? Just do `Runtime.getRuntime().exec("gcc", ...)` during the test? Can you create some native libraries as a part of the test library build? I do not know much jtreg. One can try `Runtime.getRuntime().exec("gcc", ...)` and Java experts can improve it later. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1195086680 From duke at openjdk.org Tue May 16 13:09:20 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 16 May 2023 13:09:20 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Tue, 16 May 2023 12:26:23 GMT, Jan Kratochvil wrote: >> How would you compile that library as part of of the jtreg? Just do `Runtime.getRuntime().exec("gcc", ...)` during the test? Can you create some native libraries as a part of the test library build? > > I do not know much jtreg. One can try `Runtime.getRuntime().exec("gcc", ...)` and Java experts can improve it later. I've checked that test sources actually contain various *.c files, but it doesn't seem like this would be compiled from the test. It would make sense to do that from the Makefiles, but haven't found explicit mentions... Anyway I find packaging stuff in containers quite convenient (and we don't have to add any extra C sources). I can add the released build of CRIU into the test-base image and remove the part mounting it from local system, would that work for you? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1195134303 From duke at openjdk.org Tue May 16 13:09:20 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 16 May 2023 13:09:20 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Tue, 16 May 2023 13:04:02 GMT, Radim Vansa wrote: >> I do not know much jtreg. One can try `Runtime.getRuntime().exec("gcc", ...)` and Java experts can improve it later. > > I've checked that test sources actually contain various *.c files, but it doesn't seem like this would be compiled from the test. It would make sense to do that from the Makefiles, but haven't found explicit mentions... > Anyway I find packaging stuff in containers quite convenient (and we don't have to add any extra C sources). I can add the released build of CRIU into the test-base image and remove the part mounting it from local system, would that work for you? I am not sure what it then tests when everything is bundled. But I am fine if the test does PASS. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1195137313 From duke at openjdk.org Tue May 16 13:11:15 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 16 May 2023 13:11:15 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 18:07:40 GMT, Anton Kozlov wrote: >> Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: >> >> -altstack > > src/hotspot/cpu/x86/vm_version_x86.cpp line 731: > >> 729: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, FMA) >> 730: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, LZCNT) >> 731: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)) > > SKARA complians on this line What does "SKARA complians" (complains) mean? I haven't found any warning anywhere. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1195140306 From duke at openjdk.org Tue May 16 13:21:24 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 16 May 2023 13:21:24 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 17:56:28 GMT, Anton Kozlov wrote: >> Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: >> >> -altstack > > src/hotspot/cpu/x86/vm_version_x86.cpp line 717: > >> 715: if ((excessive_CPU & CPU_SSE3) || >> 716: (excessive_GLIBC & (GLIBC_CMPXCHG16 | GLIBC_LAHFSAHF))) { >> 717: assert(!(excessive_CPU & CPU_SSE4_2), "(_features & CPU_SSE4_2) cannot happen"); > > Failed assert prints the failed condition, no need to repeat in the message. `_features` and `excessive_CPU` are two distinct variables. I have tried to rephrase the message: assert(!(excessive_CPU & CPU_SSE4_2), "CPU_SSE4_2 in both _features and excessive_CPU cannot happen"); ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1195149131 From duke at openjdk.org Tue May 16 13:21:24 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 16 May 2023 13:21:24 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v17] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Adjust assertion messages. - review by Anton Kozlov ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/04a11ef3..e7ddad55 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=16 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=15-16 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Tue May 16 13:26:44 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 16 May 2023 13:26:44 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v18] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Move disable_* more down where it is being used - reviewed by Anton Kozlov ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/e7ddad55..4f8a6a43 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=17 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=16-17 Stats: 4 lines in 1 file changed: 2 ins; 2 del; 0 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From akozlov at openjdk.org Tue May 16 13:26:48 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 16 May 2023 13:26:48 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: On Tue, 16 May 2023 13:08:40 GMT, Jan Kratochvil wrote: >> src/hotspot/cpu/x86/vm_version_x86.cpp line 731: >> >>> 729: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, FMA) >>> 730: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, LZCNT) >>> 731: // glibc: && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)) >> >> SKARA complians on this line > > What does "SKARA complians" (complains) mean? I haven't found any warning anywhere. Hmm, I still see an error: https://github.com/openjdk/crac/pull/41/checks?check_run_id=13518791480 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1195155670 From akozlov at openjdk.org Tue May 16 14:06:49 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 16 May 2023 14:06:49 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v6] In-Reply-To: References: Message-ID: On Tue, 16 May 2023 13:06:20 GMT, Jan Kratochvil wrote: >> I've checked that test sources actually contain various *.c files, but it doesn't seem like this would be compiled from the test. It would make sense to do that from the Makefiles, but haven't found explicit mentions... >> Anyway I find packaging stuff in containers quite convenient (and we don't have to add any extra C sources). I can add the released build of CRIU into the test-base image and remove the part mounting it from local system, would that work for you? > > I am not sure what it then tests when everything is bundled. But I am fine if the test does PASS. JTReg tests has so-called native-image, a part of the test-image, consisting of the native code required for the tests to pass. It's created in during jdk build. There are a few .c files in the test/hotspot/jtreg/, probably they would help... ------------- PR Review Comment: https://git.openjdk.org/crac/pull/53#discussion_r1195219192 From duke at openjdk.org Tue May 16 14:29:44 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Tue, 16 May 2023 14:29:44 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v19] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Whitespace fix. - review by Anton Kozlov ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/4f8a6a43..caaa3a27 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=18 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=17-18 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Wed May 17 12:19:50 2023 From: duke at openjdk.org (joeylee) Date: Wed, 17 May 2023 12:19:50 GMT Subject: [crac] RFR: Linux file system watcher support [v2] In-Reply-To: References: Message-ID: > inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. > > FileWatcherAfterRestoreTest verifies watcher service works after restore. > FileWatcherTest verifies automatic closing inotiify fd > > The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest joeylee has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Linux file system watcher support ------------- Changes: - all: https://git.openjdk.org/crac/pull/72/files - new: https://git.openjdk.org/crac/pull/72/files/20f4bee1..36dfd224 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=72&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=72&range=00-01 Stats: 73 lines in 3 files changed: 8 ins; 54 del; 11 mod Patch: https://git.openjdk.org/crac/pull/72.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/72/head:pull/72 PR: https://git.openjdk.org/crac/pull/72 From duke at openjdk.org Wed May 17 12:19:52 2023 From: duke at openjdk.org (joeylee) Date: Wed, 17 May 2023 12:19:52 GMT Subject: [crac] RFR: Linux file system watcher support [v2] In-Reply-To: <140Zj7oo7kRyFYqSARRLaxNAucyQa2RCy7rcRrEuYGo=.22f2507b-09ef-4512-a8ac-96e5c4df6e65@github.com> References: <140Zj7oo7kRyFYqSARRLaxNAucyQa2RCy7rcRrEuYGo=.22f2507b-09ef-4512-a8ac-96e5c4df6e65@github.com> Message-ID: On Mon, 15 May 2023 07:03:18 GMT, Radim Vansa wrote: >> joeylee has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> Linux file system watcher support > > test/jdk/jdk/crac/fileDescriptors/FileWatcherAfterRestoreTest.java line 48: > >> 46: @Override >> 47: public void exec() throws Exception { >> 48: WatchService watchService = FileSystems.getDefault().newWatchService(); > > Could you call `close()` on this service at the end of the test to check that it works as it should after C/R? ok ------------- PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1196413922 From duke at openjdk.org Wed May 17 12:36:15 2023 From: duke at openjdk.org (joeylee) Date: Wed, 17 May 2023 12:36:15 GMT Subject: [crac] RFR: Linux file system watcher support [v2] In-Reply-To: <140Zj7oo7kRyFYqSARRLaxNAucyQa2RCy7rcRrEuYGo=.22f2507b-09ef-4512-a8ac-96e5c4df6e65@github.com> References: <140Zj7oo7kRyFYqSARRLaxNAucyQa2RCy7rcRrEuYGo=.22f2507b-09ef-4512-a8ac-96e5c4df6e65@github.com> Message-ID: On Mon, 15 May 2023 06:48:08 GMT, Radim Vansa wrote: >> joeylee has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> Linux file system watcher support > > src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 367: > >> 365: // wait for close or inotify event >> 366: try { >> 367: do { > > When the `poll` is followed by a C/R we call it second time, ignoring the old value. Will this forget any events? > Obviously all the events happening when the application is in snapshot will be lost, but I wonder whether we should queue and replay anything that's already recorded. In the future (not necessarily in this PR) it would be nice to detect if anything happened when the application was in snapshot and generate events to keep its view up to date. I think it's ok to ignore the previous poll result, in this case: 1. wakeup() called by other thread, and polling thread received the notify 2. the process begin checkpoint and block at processCheckpointRestore, forget the previous notify 3. the process restored, and proceed with no notify. my previous design was to auto save and reopen the inotify and socketpair, and let user control the watch keys, the only place where notify is used is during request, after restore all keys should be re-registered by user, so I thought it's ok to drop the notify. `I wonder if you have an example of code where it's useful to automatically suspend the service but leave WatchKey management up to the user. ` I thought the watch key path is dependent on the running environment, so during restore I couldn't take over, because at restore step the path might not exist or change, so I am leaving watch keys management for users. > src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 519: > >> 517: return; >> 518: } >> 519: synchronized (checkpointLock) { > > I wonder, if the `initFD` throws during restore, shouldn't we throw here as well? I know you were inspired in EPollSelector, but let's give it a thought. I added a check at after restore to check if initFD succeed. if (checkpointState != CheckpointRestoreState.NORMAL_OPERATION) { throw new CheckpointException("LinuxWatchService restore exception"); } > test/jdk/jdk/crac/fileDescriptors/FileWatcherAfterRestoreTest.java line 54: > >> 52: Path directory = Paths.get(System.getProperty("user.dir")); >> 53: WatchKey key = directory.register(watchService, StandardWatchEventKinds.ENTRY_CREATE); >> 54: Thread.sleep(200); > > Why the sleep? removed useless sleep > test/jdk/jdk/crac/fileDescriptors/FileWatcherTest.java line 36: > >> 34: * @run driver jdk.test.lib.crac.CracTest >> 35: */ >> 36: public class FileWatcherTest implements CracTest { > > Does this test anything more than what `FileWatcherAfterRestoreTest` does anyway? removed duplicate test ------------- PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1196437208 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1196440583 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1196441418 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1196441121 From duke at openjdk.org Wed May 17 12:48:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 17 May 2023 12:48:14 GMT Subject: [crac] RFR: Linux file system watcher support [v2] In-Reply-To: References: <140Zj7oo7kRyFYqSARRLaxNAucyQa2RCy7rcRrEuYGo=.22f2507b-09ef-4512-a8ac-96e5c4df6e65@github.com> Message-ID: On Wed, 17 May 2023 12:30:17 GMT, joeylee wrote: >> src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 367: >> >>> 365: // wait for close or inotify event >>> 366: try { >>> 367: do { >> >> When the `poll` is followed by a C/R we call it second time, ignoring the old value. Will this forget any events? >> Obviously all the events happening when the application is in snapshot will be lost, but I wonder whether we should queue and replay anything that's already recorded. In the future (not necessarily in this PR) it would be nice to detect if anything happened when the application was in snapshot and generate events to keep its view up to date. > > I think it's ok to ignore the previous poll result, in this case: > 1. wakeup() called by other thread, and polling thread received the notify > 2. the process begin checkpoint and block at processCheckpointRestore, forget the previous notify > 3. the process restored, and proceed with no notify. > > my previous design was to auto save and reopen the inotify and socketpair, and let user control the watch keys, the only place where notify is used is during request, after restore all keys should be re-registered by user, so I thought it's ok to drop the notify. > > `I wonder if you have an example of code where it's useful to automatically suspend the service but leave WatchKey management up to the user. ` > I thought the watch key path is dependent on the running environment, so during restore I couldn't take over, because at restore step the path might not exist or change, so I am leaving watch keys management for users. Let's explain the design first, please, then we can think about the lost wakeup. I wrote the comment actually before I realized that watch keys are not handled. I understand the hesitation to manage watch keys automatically. What I am missing, though, is an example of idiomatic code where the user actually manages the watch keys during C/R - you only provide a test that checks that C/R fails when the key is left open. There I would like to demonstrate that closing the notify service automatically simplifies the code, rather than just adding a close/reopen to the watch key management. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1196459627 From akozlov at openjdk.org Wed May 17 15:04:10 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 17 May 2023 15:04:10 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies Message-ID: A follow-up work for #60: * Each priority now has a dedicated context, so contextes may provide different policies. CALL_SITE now uses new CriticalUnorderedContext that runs beforeCheckpoint on concurrent registration, fixes [1]. Whether or not CALL_SITE needs to be registered to at all is an open question and out of scope of this PR. * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. * hierachy of the Context implementations are cleaned up a bit [2] The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 ------------- Commit messages: - Cleanup - Cleanup - Update - All done Changes: https://git.openjdk.org/crac/pull/74/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=74&range=00 Stats: 993 lines in 29 files changed: 372 ins; 549 del; 72 mod Patch: https://git.openjdk.org/crac/pull/74.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/74/head:pull/74 PR: https://git.openjdk.org/crac/pull/74 From jack.koenig3 at gmail.com Thu May 18 01:37:18 2023 From: jack.koenig3 at gmail.com (Jack Koenig) Date: Wed, 17 May 2023 18:37:18 -0700 Subject: Problems with /var/lib/sss/mc/passwd Message-ID: Hello everyone, This is more of a user question, so I apologize if this is the wrong venue--please direct me to the right place as appropriate. I am attempting to checkpoint my application but I get an exception saying that /var/lib/sss/mc/passwd is open: An exception during a checkpoint operation: jdk.internal.crac.CheckpointException at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141) at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246) at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262) Suppressed: jdk.internal.crac.impl.CheckpointOpenFileException: /var/lib/sss/mc/passwd at java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:87) at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145) ... 2 more The only thing I've found mentioning a similar issue is this old thread: https://mail.openjdk.org/pipermail/crac-dev/2022-January/000079.html The workaround posted there involves system-level configuration changes, but I am an unprivileged user on a shared RHEL8 machine so cannot apply such a workaround. Is there anything I can do to resolve or at least workaround this issue? Cheers, Jack -------------- next part -------------- An HTML attachment was scrubbed... URL: From duke at openjdk.org Thu May 18 06:54:23 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 18 May 2023 06:54:23 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: On Thu, 11 May 2023 14:39:21 GMT, Jan Kratochvil wrote: >>> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. >> >> AFAIK the bug is fixed, but I see no point of relying on OS here. Is there one? Timens that is not changed by CRIU provides correct values for our nanoTime() [1]. >> >>> The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine >> >> [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/System.html#nanoTime() > > Upstream criu does provide the time namespace as stated by @AntonKozlov: > > CLOCK_MONOTONIC=301.735134591 CLOCK_BOOTTIME=301.735155494 > CLOCK_MONOTONIC=302.735345917 CLOCK_BOOTTIME=302.735358109 > Warn (compel/arch/x86/src/lib/infect.c:356): Will restore 7757 with interrupted system call > [1]+ Killed ./clock_gettime > restore: > CLOCK_MONOTONIC=302.803360137 CLOCK_BOOTTIME=302.803373299 > restore: > CLOCK_MONOTONIC=302.806677876 CLOCK_BOOTTIME=302.806696098 > > I do not see why JVM should reimplement what CRIU already does. One can solve that in the future when CRaC is really going to run on non-Linux system. One will need to port or reimplement there CRIU in the first place. @jankratochvil The problem with 'hanging' restore is actually not specific to Fedora at all; from the description I thought that it is unresponsive in native part, but `jstack` tells that it was your testcase sleeping in `Thread.sleep()`. Turns out `os::PlatformEvent::park` creates an absolute timestamp on monotonic time and calls `pthread_cond_timedwait` - when the time runs backward during restore this does not get reevaluated. Based on the codepath I think that this would not be limited to `Thread.sleep()` but many other 'waiting' synchronization primitives. While I consider this a serious issue (and I'll be thinking about a solution), I think that it's independent from this PR - this handles the value we're reading directly through `System.nanoTime()`. I'll also test whether it happens when timens is used by CRIU. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1552571027 From duke at openjdk.org Thu May 18 09:37:23 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 18 May 2023 09:37:23 GMT Subject: [crac] RFR: Support updating MANAGEABLE JVM options during restore [v4] In-Reply-To: References: Message-ID: On Mon, 15 May 2023 14:11:08 GMT, Anton Kozlov wrote: >> Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: >> >> Make CRTrace RESTORE_SETTABLE rather than MANAGEABLE > > Oops, there is a conflict. Could you please update the PR? @AntonKozlov could you type `/sponsor` once more, please? ------------- PR Comment: https://git.openjdk.org/crac/pull/61#issuecomment-1552793516 From rvansa at azul.com Thu May 18 09:58:04 2023 From: rvansa at azul.com (Radim Vansa) Date: Thu, 18 May 2023 11:58:04 +0200 Subject: Problems with /var/lib/sss/mc/passwd In-Reply-To: References: Message-ID: <1f4ed460-d22e-8d60-6e9c-e1859bdcfc89@azul.com> Hello Jack, the proper venue could be the Foojay.io forums [1] (yes, only recently created) or #crac channel on Foojay slack, but this list will do :) Can you try running the checkpoint with `-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` ? This should bypass the checks, though problems may arise on restore if this file changes when the application is in checkpoint. Radim [1] https://forums.foojay.io/forums/forum/coordinated-restore-at-checkpoint-crac/ On 18. 05. 23 3:37, Jack Koenig wrote: > > > Caution: This email originated from outside of the organization. Do > not click links or open attachments unless you recognize the sender > and know the content is safe. > > > Hello everyone, > > This is more of a user question, so I apologize if this is the wrong > venue--please direct me to the right place as appropriate. > > I am attempting to checkpoint my application but I get an exception > saying that /var/lib/sss/mc/passwd is open: > > An exception during a checkpoint operation: > > jdk.internal.crac.CheckpointException > ? ? ? ? at > java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141) > ? ? ? ? at > java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246) > ? ? ? ? at > java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262) > ? ? ? ? Suppressed: > jdk.internal.crac.impl.CheckpointOpenFileException: /var/lib/sss/mc/passwd > ? ? ? ? ? ? ? ? at > java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:87) > ? ? ? ? ? ? ? ? at > java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145) > ? ? ? ? ? ? ? ? ... 2 more > > The only thing I've found mentioning a similar issue is this old > thread: > https://mail.openjdk.org/pipermail/crac-dev/2022-January/000079.html > > The workaround posted there involves system-level configuration > changes, but I am an unprivileged user on a shared RHEL8 machine so > cannot apply such a workaround. > > Is there anything I can do to resolve or at least workaround this issue? > > Cheers, > Jack -------------- next part -------------- An HTML attachment was scrubbed... URL: From duke at openjdk.org Thu May 18 10:48:25 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 18 May 2023 10:48:25 GMT Subject: [crac] Integrated: Support updating MANAGEABLE JVM options during restore In-Reply-To: References: Message-ID: On Fri, 28 Apr 2023 07:24:06 GMT, Radim Vansa wrote: > When a JVM option is MANAGEABLE it can be set at any time during runtime, therefore it is safe to change it during the restore operation. Rather than silently ignoring JVM options passed along with -XX:CRaCRestoreFrom we send them to the restored process and either update or print a warning if given option cannot be changed. This pull request has now been integrated. Changeset: ed3efac0 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/ed3efac0d047d7822203536df851ccdea585d8ac Stats: 254 lines in 12 files changed: 190 ins; 10 del; 54 mod Support updating MANAGEABLE JVM options during restore Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/61 From akozlov at azul.com Thu May 18 11:32:00 2023 From: akozlov at azul.com (Anton Kozlov) Date: Thu, 18 May 2023 14:32:00 +0300 Subject: CFV: New CRaC Committer: Radim Vansa Message-ID: I hereby nominate Radim Vansa to CRaC Committer. Radim is an engineer at Azul and has contributed 11 patches [3], and more to come. Becoming the Committer acknowledges his contributions and passion to make the project successful, easing future works on the project. Votes are due by Thursday, 1 June 2023, 12:00 UTC. Only current CRaC Committers [1] are eligible to vote on this nomination. Votes must be cast in the open by replying to this mailing list. For Lazy Consensus voting instructions, see [2]. Anton Kozlov [1] https://openjdk.org/census [2] https://openjdk.org/projects/#committer-vote [3] https://github.com/openjdk/crac/pulls?q=is%3Apr+author%3Arvansa+is%3Aclosed+label%3Aintegrated From duke at openjdk.org Thu May 18 12:40:15 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 18 May 2023 12:40:15 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Wed, 17 May 2023 14:44:08 GMT, Anton Kozlov wrote: > A follow-up work for #60: > > * Each priority now has a dedicated context, so contextes may provide different policies. CALL_SITE now uses new CriticalUnorderedContext that runs beforeCheckpoint on concurrent registration, fixes [1]. Whether or not CALL_SITE needs to be registered to at all is an open question and out of scope of this PR. > * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. > * hierachy of the Context implementations are cleaned up a bit [2] > > The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. > > [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 > [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 My main objection is allowing the checkpoint to proceed even if a resource in the `CriticalUnorderedContext` throws an exception. However, when I started the review I was hoping that you'll be able to remove the special priority for call sites and broken encapsulation in `MethodHandleNatives`. You've made this change a explicitly out of scope, but without this and no extra test that would demonstrate a previously flawed behavior it is unclear what is the actual benefit this PR brings. src/java.base/share/classes/jdk/crac/impl/BlockingOrderedContext.java line 20: > 18: // We won't cause IllegalStateException because this is not an unexpected state > 19: // from the point of CRaC - it probably tried to register some code before. > 20: throw new RuntimeException("Interrupted thread tried to block in registration of " + resource + " in " + this); The use of `this` in the exception relies on naming the context and the `toString()` method for easy identification. Since you've removed these it will show only class type rather than Global Context/custom name/JDK resource priority. src/java.base/share/classes/jdk/crac/impl/CriticalUnorderedContext.java line 54: > 52: synchronized (this) { > 53: ExceptionHolder e = concurrentCheckpointException; > 54: concurrentCheckpointException = new ExceptionHolder<>(CheckpointException::new); Rather than replacing the instance, could we make `throwIfAny` reset its state before throwing? That way we won't get the same exception thrown twice by accident. src/java.base/share/classes/jdk/crac/impl/CriticalUnorderedContext.java line 83: > 81: if (concurrentCheckpointException != null) { > 82: try { > 83: invokeBeforeCheckpoint(resource); This context is blocking as well - since you hold the monitor on this object if the code tries to indirectly (via operation in another thread) register other resource, it will deadlock, and cannot be woken up via interrupt. If the only usage is call sites we know that there won't be a recursive registration, so this not such a big flaw, though. src/java.base/share/classes/jdk/crac/impl/CriticalUnorderedContext.java line 85: > 83: invokeBeforeCheckpoint(resource); > 84: } catch (Exception e) { > 85: concurrentCheckpointException.handle(e); I **really** dislike the fact that the exception is reported only during restore, which may never happen. The checkpoint should be marked for failure in here. src/java.base/share/classes/jdk/crac/impl/ExceptionHolder.java line 37: > 35: E exception = get(); > 36: if (exception.getClass() == e.getClass()) { > 37: for (Throwable t : e.getSuppressed()) { We're losing the message and stack trace here. Previously, if the message was present it was added to the suppressed list as well. If you want to create aggregate-only exceptions (only suppressed list would be relevant) these should be declared as final, with only one no-arg constructor and stack trace collection disabled - but that would have some problems as discussed in https://github.com/openjdk/crac/pull/64/files#r1190679470 src/java.base/share/classes/jdk/internal/crac/JDKResource.java line 31: > 29: import jdk.crac.Resource; > 30: > 31: public interface JDKResource extends Resource { Could we drop the interface completely? test/jdk/jdk/crac/ContextOrderTest.java line 67: > 65: var recorder = new LinkedList(); > 66: getGlobalContext().register(new MockResource(recorder, null, "regular1")); > 67: JDKResource resource2 = new MockResource(recorder, NORMAL, "jdk-normal"); Looks like the local vars are not used? test/jdk/jdk/crac/ContextOrderTest.java line 106: > 104: > 105: // blocks register into the same OrderedContext > 106: Context context = new BlockingOrderedContext<>(); Why does this create a children context rather than using global one directly? test/jdk/jdk/crac/ContextOrderTest.java line 256: > 254: > 255: if (priority != null) { > 256: priority.getContext().register(this); Registering to global context explicitly and to local one implicitly is confusing; if the mock does not require the `priority` field just remove the constructor arg and register it directly as well. test/jdk/jdk/crac/ContextOrderTest.java line 288: > 286: private CreatingResource(List recorder, Priority priority, String id, Priority childPriority) { > 287: super(recorder, priority, id); > 288: this.childContext = (Context) childPriority.getContext(); // XXX `// XXX` ------------- Changes requested by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/74#pullrequestreview-1432550985 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197759128 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197722674 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197740085 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197730194 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197726046 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197762902 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197751634 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197747285 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197756264 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197748743 From duke at openjdk.org Thu May 18 14:00:19 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Thu, 18 May 2023 14:00:19 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: On Thu, 4 May 2023 17:39:25 GMT, Anton Kozlov wrote: >> Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: >> >> -altstack > > src/hotspot/share/runtime/stubCodeGenerator.cpp line 62: > >> 60: void StubCodeDesc::thaw() { >> 61: assert(_frozen, "repeated thaw operation"); >> 62: _frozen = false; > > Is it still necessary? I've tried to comment this line out, and checkpoint-restore succeded for me. I get during restore: # Internal Error (../../src/hotspot/share/runtime/stubCodeGenerator.hpp:72), pid=12265, tid=12273 # assert(!_frozen) failed: no modifications allowed Did you really use a `*debug` build (and not `release` build)? The crash above has been generated on `slowdebug`. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1197863674 From heidinga at redhat.com Thu May 18 14:45:17 2023 From: heidinga at redhat.com (Dan Heidinga) Date: Thu, 18 May 2023 10:45:17 -0400 Subject: CFV: New CRaC Committer: Radim Vansa In-Reply-To: References: Message-ID: Vote: yes --Dan On Thu, May 18, 2023 at 7:41?AM Anton Kozlov wrote: > I hereby nominate Radim Vansa to CRaC Committer. > > Radim is an engineer at Azul and has contributed 11 patches [3], and more > to come. > Becoming the Committer acknowledges his contributions and passion to make > the > project successful, easing future works on the project. > > Votes are due by Thursday, 1 June 2023, 12:00 UTC. > > Only current CRaC Committers [1] are eligible to vote > on this nomination. Votes must be cast in the open by replying > to this mailing list. > > For Lazy Consensus voting instructions, see [2]. > > Anton Kozlov > > [1] https://openjdk.org/census > [2] https://openjdk.org/projects/#committer-vote > [3] > https://github.com/openjdk/crac/pulls?q=is%3Apr+author%3Arvansa+is%3Aclosed+label%3Aintegrated > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From akozlov at openjdk.org Thu May 18 15:38:38 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 18 May 2023 15:38:38 GMT Subject: [crac] RFR: Ensure all notifications finish even if only daemon threads remain [v6] In-Reply-To: References: Message-ID: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - Fix test build - Merge remote-tracking branch 'jdk/crac/crac' into daemon-after-restore - Move KeepAlive to separate class, handle interrupts - Cleanup - Fix recursiveCheckpoint test - Test update - Fix copyright - Notify on the original thread - Ensure all notifications finish even if only daemon threads remain ------------- Changes: https://git.openjdk.org/crac/pull/62/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=62&range=05 Stats: 203 lines in 4 files changed: 188 ins; 7 del; 8 mod Patch: https://git.openjdk.org/crac/pull/62.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/62/head:pull/62 PR: https://git.openjdk.org/crac/pull/62 From akozlov at openjdk.org Thu May 18 15:59:15 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 18 May 2023 15:59:15 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Thu, 18 May 2023 11:48:12 GMT, Radim Vansa wrote: >> A follow-up work for #60: >> >> * Each priority now has a dedicated context, so contextes may provide different policies. CALL_SITE now uses new CriticalUnorderedContext that runs beforeCheckpoint on concurrent registration, fixes [1]. Whether or not CALL_SITE needs to be registered to at all is an open question and out of scope of this PR. >> * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. >> * hierachy of the Context implementations are cleaned up a bit [2] >> >> The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 >> [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 > > src/java.base/share/classes/jdk/crac/impl/CriticalUnorderedContext.java line 54: > >> 52: synchronized (this) { >> 53: ExceptionHolder e = concurrentCheckpointException; >> 54: concurrentCheckpointException = new ExceptionHolder<>(CheckpointException::new); > > Rather than replacing the instance, could we make `throwIfAny` reset its state before throwing? That way we won't get the same exception thrown twice by accident. Exception holder is inteded to be a short-living object. Doing that is possible, but will also create a temptation to reuse-the object, that will complicate its lifecycle without a need. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1197996500 From akozlov at openjdk.org Thu May 18 16:16:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 18 May 2023 16:16:23 GMT Subject: [crac] RFR: Minor code cleanup and improvements In-Reply-To: <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> Message-ID: On Thu, 11 May 2023 06:13:12 GMT, Radim Vansa wrote: >> src/java.base/share/classes/javax/crac/CheckpointException.java line 50: >> >>> 48: * @param message the detail message. >>> 49: */ >>> 50: public CheckpointException(String message) { >> >> What if we remove this constructor and hide the other one? https://github.com/openjdk/crac/pull/60/files#r1190070472 > > First of all, I think that `javax.crac` should mirror `jdk.crac` API- and docs-wise. It will be much easier when everyone will be able to just change the imports. > > About the constructor with message: I find a bit confusing when an exception is thrown because of some problem with `criu` but there's no actionable message. I have added a simple 'Native checkpoint failed' but we should probably point user to the dump4.log file. (`criuengine` should also make some sanity checks on permissions but that's another thing). I wouldn't object to hiding it, though. > > About the one without: `Context` narrows the `throws` to CE/RE and since we expect users to implement Context this would give them no chance to throw checked exceptions, not even the aggregating one. Hiding that won't work. Yes, that is a bug javax.crac does not follow jdk.crac. Agree about the constructor without args. And still want to delete the a constructor with message. We can introduce a jdk.crac.impl.CheckpointMessageException (along j.c.i.CheckpointOpenResourceException) which we use to communicate different reason(s) checkpoint is not successful. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1198015311 From akozlov at openjdk.org Thu May 18 16:17:40 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 18 May 2023 16:17:40 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: <7Bh4cgurwfXYb_q3l6ZPmAghMtyhKWGZrUfYhcrevfs=.075e66c5-9f58-4c82-b596-3317335419bb@github.com> On Thu, 18 May 2023 11:52:30 GMT, Radim Vansa wrote: >> A follow-up work for #60: >> >> * Each priority now has a dedicated context, so contextes may provide different policies. CALL_SITE now uses new CriticalUnorderedContext that runs beforeCheckpoint on concurrent registration, fixes [1]. Whether or not CALL_SITE needs to be registered to at all is an open question and out of scope of this PR. >> * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. >> * hierachy of the Context implementations are cleaned up a bit [2] >> >> The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 >> [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 > > src/java.base/share/classes/jdk/crac/impl/ExceptionHolder.java line 37: > >> 35: E exception = get(); >> 36: if (exception.getClass() == e.getClass()) { >> 37: for (Throwable t : e.getSuppressed()) { > > We're losing the message and stack trace here. Previously, if the message was present it was added to the suppressed list as well. > If you want to create aggregate-only exceptions (only suppressed list would be relevant) these should be declared as final, with only one no-arg constructor and stack trace collection disabled - but that would have some problems as discussed in https://github.com/openjdk/crac/pull/64/files#r1190679470 To avoid this PR grow too much, I basically agree with the comments, but still propose an aggregate exception, replied to the comment. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1198015862 From akozlov at openjdk.org Thu May 18 16:23:25 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 18 May 2023 16:23:25 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Thu, 18 May 2023 12:09:36 GMT, Radim Vansa wrote: >> A follow-up work for #60: >> >> * Each priority now has a dedicated context, so contextes may provide different policies. CALL_SITE now uses new CriticalUnorderedContext that runs beforeCheckpoint on concurrent registration, fixes [1]. Whether or not CALL_SITE needs to be registered to at all is an open question and out of scope of this PR. >> * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. >> * hierachy of the Context implementations are cleaned up a bit [2] >> >> The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 >> [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 > > src/java.base/share/classes/jdk/crac/impl/CriticalUnorderedContext.java line 83: > >> 81: if (concurrentCheckpointException != null) { >> 82: try { >> 83: invokeBeforeCheckpoint(resource); > > This context is blocking as well - since you hold the monitor on this object if the code tries to indirectly (via operation in another thread) register other resource, it will deadlock, and cannot be woken up via interrupt. > If the only usage is call sites we know that there won't be a recursive registration, so this not such a big flaw, though. Right, this context implementation something that should suit JDK needs, and the point is to verify the user is capable to do something similar with existing API. So, indeed, there is a possiblity of the deadlock with other thread. Fixing this does not seem trivial, so let's assume JDK will have to do something about this, if the problem become possible. > test/jdk/jdk/crac/ContextOrderTest.java line 106: > >> 104: >> 105: // blocks register into the same OrderedContext >> 106: Context context = new BlockingOrderedContext<>(); > > Why does this create a children context rather than using global one directly? Global Context was changed to non-blocking OrderedContext as stated in the PR description, but the test needs BlockingOrderedContext. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1198023550 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1198024463 From akozlov at openjdk.org Thu May 18 16:30:20 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 18 May 2023 16:30:20 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Thu, 18 May 2023 12:23:27 GMT, Radim Vansa wrote: >> A follow-up work for #60: >> >> * Each priority now has a dedicated context, so contextes may provide different policies. CALL_SITE now uses new CriticalUnorderedContext that runs beforeCheckpoint on concurrent registration, fixes [1]. Whether or not CALL_SITE needs to be registered to at all is an open question and out of scope of this PR. >> * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. >> * hierachy of the Context implementations are cleaned up a bit [2] >> >> The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 >> [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 > > test/jdk/jdk/crac/ContextOrderTest.java line 67: > >> 65: var recorder = new LinkedList(); >> 66: getGlobalContext().register(new MockResource(recorder, null, "regular1")); >> 67: JDKResource resource2 = new MockResource(recorder, NORMAL, "jdk-normal"); > > Looks like the local vars are not used? Overlooked Resources are creating strong refs in their constructors. > test/jdk/jdk/crac/ContextOrderTest.java line 288: > >> 286: private CreatingResource(List recorder, Priority priority, String id, Priority childPriority) { >> 287: super(recorder, priority, id); >> 288: this.childContext = (Context) childPriority.getContext(); // XXX > > `// XXX` Indeed, I feel something wrong with the resulting types in the test, but not sure how to fix. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1198030533 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1198028792 From duke at openjdk.org Fri May 19 06:55:21 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 19 May 2023 06:55:21 GMT Subject: [crac] RFR: Minor code cleanup and improvements In-Reply-To: References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> Message-ID: <6DVU9HYy46e5dglqxqUXNfl17sKEM7XuP1H5kplhEC8=.caead7f0-602c-4fa3-a1e5-97b09a50beb4@github.com> On Thu, 18 May 2023 16:13:58 GMT, Anton Kozlov wrote: >> First of all, I think that `javax.crac` should mirror `jdk.crac` API- and docs-wise. It will be much easier when everyone will be able to just change the imports. >> >> About the constructor with message: I find a bit confusing when an exception is thrown because of some problem with `criu` but there's no actionable message. I have added a simple 'Native checkpoint failed' but we should probably point user to the dump4.log file. (`criuengine` should also make some sanity checks on permissions but that's another thing). I wouldn't object to hiding it, though. >> >> About the one without: `Context` narrows the `throws` to CE/RE and since we expect users to implement Context this would give them no chance to throw checked exceptions, not even the aggregating one. Hiding that won't work. > > Yes, that is a bug javax.crac does not follow jdk.crac. > > Agree about the constructor without args. > > And still want to delete the a constructor with message. We can introduce a jdk.crac.impl.CheckpointMessageException (along j.c.i.CheckpointOpenResourceException) which we use to communicate different reason(s) checkpoint is not successful. What about turning the inheritance the other way: final CheckpointAggregateException extends CheckpointException? This would be easily distinguishable, no need to change anything on Context interface. The no-arg constructor in CheckpointException would be protected. It's kind of natural that exceptions carry messages. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1198609760 From duke at openjdk.org Fri May 19 07:04:12 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 19 May 2023 07:04:12 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Thu, 18 May 2023 16:21:07 GMT, Anton Kozlov wrote: >> test/jdk/jdk/crac/ContextOrderTest.java line 106: >> >>> 104: >>> 105: // blocks register into the same OrderedContext >>> 106: Context context = new BlockingOrderedContext<>(); >> >> Why does this create a children context rather than using global one directly? > > Global Context was changed to non-blocking OrderedContext as stated in the PR description, but the test needs BlockingOrderedContext. Yes, but it does not explain much about the **reason** but "... as that may have a huge impact on users". I don't think there's much concern about backwards compatibility at this point; it's more important to have the best UX even in case that users do an error. If you want to conserve certain behaviour, write a test that will validate what happens when you attempt to register into global context in one of those notifications. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1198615862 From duke at openjdk.org Fri May 19 10:25:18 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 19 May 2023 10:25:18 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v10] In-Reply-To: References: Message-ID: > There are various places both inside JDK and in libraries that rely on monotonicity of `System.nanotime()`. When the process is restored on a different machine the value will likely differ as the implementation provides time since machine boot. This PR records wall clock time before checkpoint and after restore and tries to adjust the value provided by nanotime() to reasonably correct value. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: - Merge branch 'crac' into nanotime - Merge branch 'crac' into nanotime - More checks when reading boot ID - Merge branch 'crac' into nanotime - Do not use negative monotonic offset - Merge branch 'crac' into nanotime - Fix whitespaces - Use image under ghcr.io/crac - Ensure monotonicity for the same boot - Set nanotime only if bootid changes - ... and 13 more: https://git.openjdk.org/crac/compare/ed3efac0...7d7a4103 ------------- Changes: https://git.openjdk.org/crac/pull/53/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=53&range=09 Stats: 313 lines in 7 files changed: 285 ins; 0 del; 28 mod Patch: https://git.openjdk.org/crac/pull/53.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/53/head:pull/53 PR: https://git.openjdk.org/crac/pull/53 From duke at openjdk.org Fri May 19 10:27:23 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 19 May 2023 10:27:23 GMT Subject: [crac] RFR: Correct System.nanotime() value after restore [v2] In-Reply-To: References: <-i0uB8ZW7r54hoKQJ_wODUXNKVkOI5rH7SJTEhSHiDw=.75ebe53a-9081-40c6-911f-048b17e8850e@github.com> Message-ID: <3rjnAFLrr83oKnGW5Jlg9wIgKgon3a4ttfGW6dXs1C8=.8487ce48-3a6b-4f27-865f-7cc851ad46b5@github.com> On Thu, 11 May 2023 14:39:21 GMT, Jan Kratochvil wrote: >>> Since you mention it was a bug in kernel or criu and it has been almost 3 years since your commit, may be it is worth enabling the criu changes again to see if the timedwait problem still exists, unless you have already done that. >> >> AFAIK the bug is fixed, but I see no point of relying on OS here. Is there one? Timens that is not changed by CRIU provides correct values for our nanoTime() [1]. >> >>> The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine >> >> [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/System.html#nanoTime() > > Upstream criu does provide the time namespace as stated by @AntonKozlov: > > CLOCK_MONOTONIC=301.735134591 CLOCK_BOOTTIME=301.735155494 > CLOCK_MONOTONIC=302.735345917 CLOCK_BOOTTIME=302.735358109 > Warn (compel/arch/x86/src/lib/infect.c:356): Will restore 7757 with interrupted system call > [1]+ Killed ./clock_gettime > restore: > CLOCK_MONOTONIC=302.803360137 CLOCK_BOOTTIME=302.803373299 > restore: > CLOCK_MONOTONIC=302.806677876 CLOCK_BOOTTIME=302.806696098 > > I do not see why JVM should reimplement what CRIU already does. One can solve that in the future when CRaC is really going to run on non-Linux system. One will need to port or reimplement there CRIU in the first place. @jankratochvil I have a fix for the sleepers in https://github.com/rvansa/crac/tree/timed_wait , will publish it as PR once this gets integrated. Also, your trouble with libbsd should be resolved once https://github.com/CRaC/container-images/pull/4 gets integrated. ------------- PR Comment: https://git.openjdk.org/crac/pull/53#issuecomment-1554361596 From duke at openjdk.org Fri May 19 12:19:00 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 19 May 2023 12:19:00 GMT Subject: [crac] RFR: Minor code cleanup and improvements [v2] In-Reply-To: References: Message-ID: <9BU3VY9dp5Ox-OEQ5AT-gCNC5QHEJ_aDG-DiMB549Pk=.1b3634ef-cfd4-45d7-b1f4-60051036e4d7@github.com> > Extracted non-essential changes from other PR. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Use Combiner exception class for aggregate-only exceptions ------------- Changes: - all: https://git.openjdk.org/crac/pull/64/files - new: https://git.openjdk.org/crac/pull/64/files/a811566e..64e8ac66 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=64&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=64&range=00-01 Stats: 97 lines in 9 files changed: 69 ins; 4 del; 24 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From duke at openjdk.org Fri May 19 12:24:15 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 19 May 2023 12:24:15 GMT Subject: [crac] RFR: Minor code cleanup and improvements [v3] In-Reply-To: References: Message-ID: > Extracted non-essential changes from other PR. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Handle .Combined in Core ------------- Changes: - all: https://git.openjdk.org/crac/pull/64/files - new: https://git.openjdk.org/crac/pull/64/files/64e8ac66..70182912 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=64&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=64&range=01-02 Stats: 11 lines in 1 file changed: 4 ins; 3 del; 4 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From duke at openjdk.org Fri May 19 12:41:38 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 19 May 2023 12:41:38 GMT Subject: [crac] RFR: Use special class for exception aggregates [v4] In-Reply-To: References: Message-ID: > Extracted non-essential changes from other PR. Radim Vansa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge branch 'crac' into code_cleanup - Handle .Combined in Core - Use Combiner exception class for aggregate-only exceptions - Minor code cleanup and improvements ------------- Changes: https://git.openjdk.org/crac/pull/64/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=64&range=03 Stats: 203 lines in 12 files changed: 118 ins; 39 del; 46 mod Patch: https://git.openjdk.org/crac/pull/64.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/64/head:pull/64 PR: https://git.openjdk.org/crac/pull/64 From duke at openjdk.org Fri May 19 13:17:14 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 19 May 2023 13:17:14 GMT Subject: [crac] RFR: Linux file system watcher support [v2] In-Reply-To: References: Message-ID: On Wed, 17 May 2023 12:19:50 GMT, joeylee wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > joeylee has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Linux file system watcher support Thanks for fixing the nitpicks; I'll wait for the test that demonstrates code managing WatchKey with this concurrent checkpoint. ------------- PR Review: https://git.openjdk.org/crac/pull/72#pullrequestreview-1434402460 From jack.koenig3 at gmail.com Fri May 19 20:58:16 2023 From: jack.koenig3 at gmail.com (Jack Koenig) Date: Fri, 19 May 2023 13:58:16 -0700 Subject: Problems with /var/lib/sss/mc/passwd In-Reply-To: References: Message-ID: Hello Radim, Thank you for your response, sorry for breaking the thread--I had digests on and cannot figure out how to set "In-Reply-To" from gmail. `-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` sounds like exactly what I need, unfortunately it doesn't seem to work in this case, no idea why but with it set I get the exact same error. I have tried to reproduce in both CentOS and Ubuntu Docker containers but have been unsuccessful--the circumstances that lead to this situation are beyond my Linux knowledge. In any case, I was able to make forward progress by using gdb to force close the file descriptor (lol). For anyone in the future who comes across this thread, you can just determine the PID of the process you wish to checkpoint, and determine the file descriptor number for /var/lib/sss/mc/passwd (for me it was always 4 which is interesting), then do the following: $ gdb -p (gdb) call (int)close() (gdb) quit After force closing the file descriptor I was able to take a checkpoint. Now, with a successful checkpoint I then tried to restore from the checkpoint and failed with: Error (criu/cr-restore.c:1335): Failed to write 897973 to /proc/sys/kernel/ns_last_pid: Operation not permitted Error (criu/cr-restore.c:1506): Can't fork for 897974: Operation not permitted Error (criu/cr-restore.c:2593): Restoring FAILED. Error (criu/cr-restore.c:1823): Pid 915630 do not match expected 897974 Since my goal is to create many processes from the same checkpoint, needing the same PID is going to be problematic, so I've started trying to see if I can use unshare to create a namespace. When I create a new namespace with: unshare -mrp --mount-proc --fork And then run the process I wish to checkpoint, to my pleasant surprise, /var/lib/sss/mc/passwd is not open, so this seems to coincidentally solve that issue. However, I am not able to create a checkpoint, when I run `jcmd JDK.checkpoint` I get: JVM: invalid info for restore provided: queued code -1 An exception during a checkpoint operation: jdk.internal.crac.CheckpointException at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141) at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246) at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262) The error isn't super precise, but I suspect the issue is that jcmd cannot find the process, if I run `jcmd -l`, nothing shows up. Note I am running this jcmd in the same namespace, but clearly I have done something wrong. If I try to create a checkpoint from outside the namespace using the real PID, the process prints a stack trace and the checkpoint fails with: com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded at sun.tools.attach.LinuxVirtualMachine.(LinuxVirtualMachine.java:106) at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63) at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:208) at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:147) at sun.tools.jcmd.JCmd.main(JCmd.java:131) Does anyone have any experience here? Is this approach of using unshare to create a new namespace going in the right direction? Thank you! Jack On Thu, 18 May 2023 11:58:04 +0200 Radim Vansa wrote: > > Hello Jack, > > the proper venue could be the Foojay.io forums [1] (yes, only recently > created) or #crac channel on Foojay slack, but this list will do :) > > Can you try running the checkpoint with > `-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` ? This should > bypass the checks, though problems may arise on restore if this file > changes when the application is in checkpoint. > > Radim > > [1] > https://forums.foojay.io/forums/forum/coordinated-restore-at-checkpoint-crac/ > > On 18. 05. 23 3:37, Jack Koenig wrote: > > > > > > Caution: This email originated from outside of the organization. Do > > not click links or open attachments unless you recognize the sender > > and know the content is safe. > > > > > > Hello everyone, > > > > This is more of a user question, so I apologize if this is the wrong > > venue--please direct me to the right place as appropriate. > > > > I am attempting to checkpoint my application but I get an exception > > saying that /var/lib/sss/mc/passwd is open: > > > > An exception during a checkpoint operation: > > > > jdk.internal.crac.CheckpointException > > ? ? ? ? at > > java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141) > > ? ? ? ? at > > java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246) > > ? ? ? ? at > > java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262) > > ? ? ? ? Suppressed: > > jdk.internal.crac.impl.CheckpointOpenFileException: /var/lib/sss/mc/passwd > > ? ? ? ? ? ? ? ? at > > java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:87) > > ? ? ? ? ? ? ? ? at > > java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145) > > ? ? ? ? ? ? ? ? ... 2 more > > > > The only thing I've found mentioning a similar issue is this old > > thread: > > https://mail.openjdk.org/pipermail/crac-dev/2022-January/000079.html > > > > The workaround posted there involves system-level configuration > > changes, but I am an unprivileged user on a shared RHEL8 machine so > > cannot apply such a workaround. > > > > Is there anything I can do to resolve or at least workaround this issue? > > > > Cheers, > > Jack From duke at openjdk.org Sat May 20 04:06:08 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Sat, 20 May 2023 04:06:08 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v20] In-Reply-To: References: Message-ID: <9lVsQ6-3CqNf4igJ1eVOqe_C05bmjWNNKwGD6xO_XXk=.16fef9fd-e790-4358-9653-39e62edc579e@github.com> > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Support !INCLUDE_CPU_FEATURE_ACTIVE via ld.so execution. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/caaa3a27..476746f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=19 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=18-19 Stats: 219 lines in 1 file changed: 182 ins; 2 del; 35 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Sun May 21 00:17:27 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Sun, 21 May 2023 00:17:27 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v21] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: +ld_so_list_diagnostics ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/476746f4..19dc7250 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=20 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=19-20 Stats: 158 lines in 4 files changed: 147 ins; 4 del; 7 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From rvansa at azul.com Mon May 22 08:25:00 2023 From: rvansa at azul.com (Radim Vansa) Date: Mon, 22 May 2023 10:25:00 +0200 Subject: Problems with /var/lib/sss/mc/passwd In-Reply-To: References: Message-ID: <604acfe2-4eb8-1209-b150-1305ef1eb389@azul.com> Hi, I've replied on the forums [1], please continue in there. Cheers, Radim [1] https://forums.foojay.io/forums/topic/problems-with-var-lib-sss-mc-passwd/#post-138 On 19. 05. 23 22:58, Jack Koenig wrote: > Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. > > > Hello Radim, > > Thank you for your response, sorry for breaking the thread--I had > digests on and cannot figure out how to set "In-Reply-To" from gmail. > > `-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` sounds like > exactly what I need, unfortunately it doesn't seem to work in this > case, no idea why but with it set I get the exact same error. I have > tried to reproduce in both CentOS and Ubuntu Docker containers but > have been unsuccessful--the circumstances that lead to this situation > are beyond my Linux knowledge. > > In any case, I was able to make forward progress by using gdb to force > close the file descriptor (lol). For anyone in the future who comes > across this thread, you can just determine the PID of the process you > wish to checkpoint, and determine the file descriptor number for > /var/lib/sss/mc/passwd (for me it was always 4 which is interesting), > then do the following: > $ gdb -p > (gdb) call (int)close() > (gdb) quit > > After force closing the file descriptor I was able to take a checkpoint. > > Now, with a successful checkpoint I then tried to restore from the > checkpoint and failed with: > > Error (criu/cr-restore.c:1335): Failed to write 897973 to > /proc/sys/kernel/ns_last_pid: Operation not permitted > Error (criu/cr-restore.c:1506): Can't fork for 897974: Operation not permitted > Error (criu/cr-restore.c:2593): Restoring FAILED. > Error (criu/cr-restore.c:1823): Pid 915630 do not match expected 897974 > > Since my goal is to create many processes from the same checkpoint, > needing the same PID is going to be problematic, so I've started > trying to see if I can use unshare to create a namespace. > > When I create a new namespace with: > unshare -mrp --mount-proc --fork > > And then run the process I wish to checkpoint, to my pleasant > surprise, /var/lib/sss/mc/passwd is not open, so this seems to > coincidentally solve that issue. > > However, I am not able to create a checkpoint, when I run > `jcmd JDK.checkpoint` I get: > > JVM: invalid info for restore provided: queued code -1 > An exception during a checkpoint operation: > jdk.internal.crac.CheckpointException > at java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141) > at java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246) > at java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262) > > The error isn't super precise, but I suspect the issue is that jcmd > cannot find the process, if I run `jcmd -l`, nothing shows up. Note I > am running this jcmd in the same namespace, but clearly I have done > something wrong. > > If I try to create a checkpoint from outside the namespace using the > real PID, the process prints a stack trace and the checkpoint fails > with: > > com.sun.tools.attach.AttachNotSupportedException: Unable to open > socket file: target process not responding or HotSpot VM not loaded > at sun.tools.attach.LinuxVirtualMachine.(LinuxVirtualMachine.java:106) > at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63) > at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:208) > at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:147) > at sun.tools.jcmd.JCmd.main(JCmd.java:131) > > Does anyone have any experience here? Is this approach of using > unshare to create a new namespace going in the right direction? > > Thank you! > Jack > > On Thu, 18 May 2023 11:58:04 +0200 Radim Vansa wrote: >> Hello Jack, >> >> the proper venue could be the Foojay.io forums [1] (yes, only recently >> created) or #crac channel on Foojay slack, but this list will do :) >> >> Can you try running the checkpoint with >> `-XX:CRaCIgnoredFileDescriptors=/var/lib/sss/mc/passwd` ? This should >> bypass the checks, though problems may arise on restore if this file >> changes when the application is in checkpoint. >> >> Radim >> >> [1] >> https://forums.foojay.io/forums/forum/coordinated-restore-at-checkpoint-crac/ >> >> On 18. 05. 23 3:37, Jack Koenig wrote: >>> >>> Caution: This email originated from outside of the organization. Do >>> not click links or open attachments unless you recognize the sender >>> and know the content is safe. >>> >>> >>> Hello everyone, >>> >>> This is more of a user question, so I apologize if this is the wrong >>> venue--please direct me to the right place as appropriate. >>> >>> I am attempting to checkpoint my application but I get an exception >>> saying that /var/lib/sss/mc/passwd is open: >>> >>> An exception during a checkpoint operation: >>> >>> jdk.internal.crac.CheckpointException >>> ? ? ? ? at >>> java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:141) >>> ? ? ? ? at >>> java.base/jdk.internal.crac.Core.checkpointRestore(Core.java:246) >>> ? ? ? ? at >>> java.base/jdk.internal.crac.Core.checkpointRestoreInternal(Core.java:262) >>> ? ? ? ? Suppressed: >>> jdk.internal.crac.impl.CheckpointOpenFileException: /var/lib/sss/mc/passwd >>> ? ? ? ? ? ? ? ? at >>> java.base/jdk.internal.crac.Core.translateJVMExceptions(Core.java:87) >>> ? ? ? ? ? ? ? ? at >>> java.base/jdk.internal.crac.Core.checkpointRestore1(Core.java:145) >>> ? ? ? ? ? ? ? ? ... 2 more >>> >>> The only thing I've found mentioning a similar issue is this old >>> thread: >>> https://mail.openjdk.org/pipermail/crac-dev/2022-January/000079.html >>> >>> The workaround posted there involves system-level configuration >>> changes, but I am an unprivileged user on a shared RHEL8 machine so >>> cannot apply such a workaround. >>> >>> Is there anything I can do to resolve or at least workaround this issue? >>> >>> Cheers, >>> Jack From duke at openjdk.org Mon May 22 11:48:22 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Mon, 22 May 2023 11:48:22 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v22] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with four additional commits since the last revision: - refactor - refactor - refactor - Fix el7 compatibility. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/19dc7250..96174b55 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=21 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=20-21 Stats: 250 lines in 2 files changed: 128 ins; 112 del; 10 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From duke at openjdk.org Mon May 22 12:46:28 2023 From: duke at openjdk.org (Jan Kratochvil) Date: Mon, 22 May 2023 12:46:28 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v14] In-Reply-To: References: Message-ID: <8a9vpNshEy1Mrcx_S2ShDI5PZqQnI2gUhU3J3qflBwk=.bfce96b1-0624-4dce-bdc1-518041383071@github.com> On Thu, 4 May 2023 18:11:01 GMT, Anton Kozlov wrote: > What is GLIBC version supporting the flag? We're used to build JDK on some older platform and assume that will work on every newer platform. > > And it turns out on my platform used for the builds the option is not supported. That should be fixed now. ------------- PR Comment: https://git.openjdk.org/crac/pull/41#issuecomment-1557153247 From hakdogan75 at gmail.com Tue May 23 14:28:55 2023 From: hakdogan75 at gmail.com (Huseyin Akdogan) Date: Tue, 23 May 2023 17:28:55 +0300 Subject: CheckpointException Message-ID: Hi all, I'm playing with CRaC through a simple example as follows: private static int counter; public static void main(String[] args) throws InterruptedException { while (true){ System.out.println(String.format("%sth greetings from Istanbul", ++counter )); Thread.sleep(1000); } } javac org/jugistanbul/crac/Greetings.java jar cfm greetings.jar manifest.txt org/jugistanbul/crac/Greetings.class java -jar greetings.jar 1th greetings from Istanbul 2th greetings from Istanbul 3th greetings from Istanbul ... everything as expected, then java -XX:CRaCCheckpointTo=image -jar greetings.jar(or java -XX:CRaCCheckpointTo=image -cp ./greetings.jar org.jugistanbul.crac.Greetings) jcmd greetings.jar JDK.checkpoint getting this error 1th greetings from Istanbul 2th greetings from Istanbul 3th greetings from Istanbul 4th greetings from Istanbul 5th greetings from Istanbul May 23, 2023 2:14:04 PM jdk.internal.util.jar.PersistentJarFile beforeCheckpoint INFO: /home/ubuntu/greetings/greetings.jar is recorded as always available on restore JVM: invalid info for restore provided: queued code -1 An exception during a checkpoint operation: jdk.crac.CheckpointException at java.base/jdk.crac.Core.checkpointRestore1(Core.java:141) at java.base/jdk.crac.Core.checkpointRestore(Core.java:246) at java.base/jdk.crac.Core.checkpointRestoreInternal(Core.java:262) 6th greetings from Istanbul 7th greetings from Istanbul 8th greetings from Istanbul 9th greetings from Istanbul 10th greetings from Istanbul This is the line pointed to in the exception output: https://github.com/openjdk/crac/blob/ed3efac0d047d7822203536df851ccdea585d8ac/src/java.base/share/classes/jdk/crac/Core.java#L141 However, the relevant line did not help me to have a clear idea of what the problem was. Any ideas to enlighten me? java -version openjdk version "17-crac" 2021-09-14 OpenJDK Runtime Environment (build 17-crac+5-19) OpenJDK 64-Bit Server VM (build 17-crac+5-19, mixed mode, sharing) uname -v #26~22.04.1-Ubuntu SMP Mon Apr 24 01:58:15 UTC 2023 Sorry if this mail list is the wrong address to post the question and ignore this message Best -- *H?seyin Akdo?an* Expert Software Consultant GitHub Medium Foojay.io Dzone JavaAdvent Twitter Linkedin -------------- next part -------------- An HTML attachment was scrubbed... URL: From duke at openjdk.org Tue May 23 15:28:47 2023 From: duke at openjdk.org (joeylee) Date: Tue, 23 May 2023 15:28:47 GMT Subject: [crac] RFR: Linux file system watcher support [v3] In-Reply-To: References: Message-ID: > inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. > > FileWatcherAfterRestoreTest verifies watcher service works after restore. > FileWatcherTest verifies automatic closing inotiify fd > > The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest joeylee has updated the pull request incrementally with two additional commits since the last revision: - update test - update test ------------- Changes: - all: https://git.openjdk.org/crac/pull/72/files - new: https://git.openjdk.org/crac/pull/72/files/36dfd224..40191d9b Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=72&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=72&range=01-02 Stats: 35 lines in 1 file changed: 25 ins; 1 del; 9 mod Patch: https://git.openjdk.org/crac/pull/72.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/72/head:pull/72 PR: https://git.openjdk.org/crac/pull/72 From duke at openjdk.org Tue May 23 15:28:59 2023 From: duke at openjdk.org (joeylee) Date: Tue, 23 May 2023 15:28:59 GMT Subject: [crac] RFR: Linux file system watcher support In-Reply-To: References: Message-ID: On Tue, 16 May 2023 08:30:41 GMT, Radim Vansa wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > I was hoping that a fastdebug build with assertions on would fail somewhere earlier but it is not the case. I would suspect that since this NULL dereference is happening when the thread already exits there's an earlier event that goes wrong. Hi @rvansa , thanks for your review. I have updated FileWatcherAfterRestoreTest with a minimal case where the monitor key could be different after restore, and this test will manually re-register alternative path. When closing the notify service automatically simplifies the code, rather than just adding a close/reopen to the watch key management. Where this patch could help simplifies code: 1. When users are manually managing all keys, this patch could only save them from a `close()` and `open()` call before and after checkpoint. 2. But some cases close and reopen a watch service could be very troublesome, like this case: Notice the `watchservice` could be final for some 3rd party libraries, in that case we can not reopen and assign a new value for watch service. This patch allows user to focus only on keys, which might change after restore. import jdk.crac.Core; import java.io.IOException; import java.nio.file.FileSystems; import java.nio.file.WatchService; public class Case { private final WatchService watcher; Case() { watcher = createWatcher(); } private static WatchService createWatcher() { try { return FileSystems.getDefault().newWatchService(); } catch (IOException e) { return null; } } } ------------- PR Comment: https://git.openjdk.org/crac/pull/72#issuecomment-1559112302 From akozlov at openjdk.org Wed May 24 12:53:28 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 24 May 2023 12:53:28 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Fri, 19 May 2023 07:01:01 GMT, Radim Vansa wrote: >> Global Context was changed to non-blocking OrderedContext as stated in the PR description, but the test needs BlockingOrderedContext. > > Yes, but it does not explain much about the **reason** but "... as that may have a huge impact on users". I don't think there's much concern about backwards compatibility at this point; it's more important to have the best UX even in case that users do an error. > > If you want to conserve certain behaviour, write a test that will validate what happens when you attempt to register into global context in one of those notifications. Apparently you mean why the global context implementation has been changed, not why the test has to use an explicit implementation. Honestly I did not realize the global context properties were changed in #60, so I just want to revert that for a while. This was rather big change. And that deserved a separate line in the javadoc, which is btw the specification, not the tests (although they are definetely useful). I lean toward using BlockingContext, but want to think a bit more about that, with the spec change. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1204074255 From akozlov at openjdk.org Wed May 24 13:08:29 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 24 May 2023 13:08:29 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Thu, 18 May 2023 12:36:22 GMT, Radim Vansa wrote: >> A follow-up work for #60: >> >> * Each priority now has a dedicated context, so contextes may provide different policies. CALL_SITE now uses new CriticalUnorderedContext that runs beforeCheckpoint on concurrent registration, fixes [1]. Whether or not CALL_SITE needs to be registered to at all is an open question and out of scope of this PR. >> * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. >> * hierachy of the Context implementations are cleaned up a bit [2] >> >> The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 >> [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 > > src/java.base/share/classes/jdk/internal/crac/JDKResource.java line 31: > >> 29: import jdk.crac.Resource; >> 30: >> 31: public interface JDKResource extends Resource { > > Could we drop the interface completely? Probably, I propose to do this a bit later. E.g. it may be useful to override beforeCheckpoint here declaring no Exception is thrown. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1204099085 From akozlov at openjdk.org Wed May 24 13:29:25 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 24 May 2023 13:29:25 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: References: Message-ID: On Tue, 16 May 2023 11:52:54 GMT, Radim Vansa wrote: > We block the cleaner thread to prevent race conditions between this thread and checkpointing thread invoking clean(). > When the cleanup starts in cleaner thread the checkpoint will skip it, but without waiting for the cleanup to finish (which might be critical for the checkpoint, e.g. closing FDs). > The limitation is that code performing C/R must not wait on any task completed by the cleaner. src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 151: > 149: while (blockForCheckpoint) { > 150: wait(); > 151: } Once we've got here, is it possible to ensure Cleaners has been called, and drop separate registration of Cleaners? A concurrent cleaner registration is not a problem, as that depends on GC which is not predictable. I.e. if that happens, a slight race may also cause the cleaner to run after restore. src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 180: > 178: // critical for the checkpoint, e.g. closing FDs). > 179: // The limitation is that code performing C/R must not wait on any task > 180: // completed by the cleaner. Could you elaborate why there is this limitation? src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 182: > 180: // completed by the cleaner. > 181: blockForCheckpoint = true; > 182: thread.interrupt(); Why it has to be interrupt and not notify(), for example? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1204137253 PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1204117380 PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1204118532 From akozlov at azul.com Wed May 24 14:09:29 2023 From: akozlov at azul.com (Anton Kozlov) Date: Wed, 24 May 2023 17:09:29 +0300 Subject: CheckpointException In-Reply-To: References: Message-ID: <91bedf8e-c119-6963-8c68-fe679906c37c@azul.com> Hi Huseyin, On 5/23/23 17:28, Huseyin Akdogan wrote: > JVM: invalid info for restore provided: queued code -1 > An exception during a checkpoint operation: > jdk.crac.CheckpointException > openjdk version "17-crac" 2021-09-14 > OpenJDK Runtime Environment (build 17-crac+5-19) > OpenJDK 64-Bit Server VM (build 17-crac+5-19, mixed mode, sharing) I suspect the problem is related to the installation [1]. So please file an issue at [2]. Thanks, Anton [1] https://github.com/CRaC/docs#jdk [2] https://github.com/CRaC/openjdk-builds/issues From hakdogan75 at gmail.com Wed May 24 15:18:13 2023 From: hakdogan75 at gmail.com (Huseyin Akdogan) Date: Wed, 24 May 2023 18:18:13 +0300 Subject: CheckpointException In-Reply-To: <91bedf8e-c119-6963-8c68-fe679906c37c@azul.com> References: <91bedf8e-c119-6963-8c68-fe679906c37c@azul.com> Message-ID: Hi Anton, Thank you for the reply and referral. I got it done: https://github.com/CRaC/openjdk-builds/issues/3 Best Anton Kozlov , 24 May 2023 ?ar, 17:09 tarihinde ?unu yazd?: > Hi Huseyin, > > On 5/23/23 17:28, Huseyin Akdogan wrote: > > JVM: invalid info for restore provided: queued code -1 > > An exception during a checkpoint operation: > > jdk.crac.CheckpointException > > > openjdk version "17-crac" 2021-09-14 > > OpenJDK Runtime Environment (build 17-crac+5-19) > > OpenJDK 64-Bit Server VM (build 17-crac+5-19, mixed mode, sharing) > > I suspect the problem is related to the installation [1]. So please file > an issue at [2]. > > Thanks, > Anton > > [1] https://github.com/CRaC/docs#jdk > [2] https://github.com/CRaC/openjdk-builds/issues > -- *H?seyin Akdo?an* Expert Software Consultant GitHub Medium Foojay.io Dzone JavaAdvent Twitter Linkedin -------------- next part -------------- An HTML attachment was scrubbed... URL: From duke at openjdk.org Thu May 25 06:25:25 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 25 May 2023 06:25:25 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Wed, 24 May 2023 12:50:47 GMT, Anton Kozlov wrote: >> Yes, but it does not explain much about the **reason** but "... as that may have a huge impact on users". I don't think there's much concern about backwards compatibility at this point; it's more important to have the best UX even in case that users do an error. >> >> If you want to conserve certain behaviour, write a test that will validate what happens when you attempt to register into global context in one of those notifications. > > Apparently you mean why the global context implementation has been changed, not why the test has to use an explicit implementation. > > Honestly I did not realize the global context properties were changed in #60, so I just want to revert that for a while. This was rather big change. And that deserved a separate line in the javadoc, which is btw the specification, not the tests (although they are definetely useful). I lean toward using BlockingContext, but want to think a bit more about that, with the spec change. Yes, the behaviour should be specified through javadoc but that does not say anything about registration when the checkpoint is proceeded. Per your logic that behaviour is unspecified and hence the change doesn't matter. Moreover, you've explicitly asked to postpone javadoc changes into #65 which did not get review in 2 weeks. Rather than changing implementation forth and back you could explain why certain behaviour is more useful. Now you've removed a test for certain codepath - creating resource in global context - if you think it should behave in a different way you should keep it and assert different outcome. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1205045522 From duke at openjdk.org Thu May 25 06:59:19 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 25 May 2023 06:59:19 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: References: Message-ID: <82OoQQrh1BZ_epMMmT9P-qabiJtcHnyoBeYD8sEAiDo=.84f640c4-8eb7-46b5-ad86-0c6d3b8ddd38@github.com> On Wed, 24 May 2023 13:16:43 GMT, Anton Kozlov wrote: >> We block the cleaner thread to prevent race conditions between this thread and checkpointing thread invoking clean(). >> When the cleanup starts in cleaner thread the checkpoint will skip it, but without waiting for the cleanup to finish (which might be critical for the checkpoint, e.g. closing FDs). >> The limitation is that code performing C/R must not wait on any task completed by the cleaner. > > src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 182: > >> 180: // completed by the cleaner. >> 181: blockForCheckpoint = true; >> 182: thread.interrupt(); > > Why it has to be interrupt and not notify(), for example? The interrupt wakes up cleaner in `queue.remove()` (line 161) in case it's blocking for next task. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1205075139 From duke at openjdk.org Thu May 25 07:28:20 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 25 May 2023 07:28:20 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: References: Message-ID: On Wed, 24 May 2023 13:26:47 GMT, Anton Kozlov wrote: >> We block the cleaner thread to prevent race conditions between this thread and checkpointing thread invoking clean(). >> When the cleanup starts in cleaner thread the checkpoint will skip it, but without waiting for the cleanup to finish (which might be critical for the checkpoint, e.g. closing FDs). >> The limitation is that code performing C/R must not wait on any task completed by the cleaner. > > src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 151: > >> 149: while (blockForCheckpoint) { >> 150: wait(); >> 151: } > > Once we've got here, is it possible to ensure Cleaners has been called, and drop separate registration of Cleaners? > > A concurrent cleaner registration is not a problem, as that depends on GC which is not predictable. I.e. if that happens, a slight race may also cause the cleaner to run after restore. I don't follow. PhantomCleanableRefs are cleaned strictly after this point (these have later priority), other cleaners will be queued up by GC but not called until restore. > src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 180: > >> 178: // critical for the checkpoint, e.g. closing FDs). >> 179: // The limitation is that code performing C/R must not wait on any task >> 180: // completed by the cleaner. > > Could you elaborate why there is this limitation? The thread is blocked until `afterRestore`. PhantomCleanableRefs are cleaned independently within its own priority class (after the thread is stopped), but any other cleanup tasks simply won't happen. Therefore if C/R after this point waits for the cleanup, it will deadlock. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1205103396 PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1205100248 From duke at openjdk.org Thu May 25 07:41:07 2023 From: duke at openjdk.org (joeylee) Date: Thu, 25 May 2023 07:41:07 GMT Subject: [crac] RFR: Remove code trigger register during checkpointRestore Message-ID: Register higher priority context during checkpoint could lead to dead lock, this patch removes the code that triggers register during checkpoint call. Thread dump for `CracOptionTest` without this patch 2023-05-23 18:10:39 Full thread dump OpenJDK 64-Bit Server VM (17-internal+0-adhoc.ubuntu.jdk mixed mode, sharing): Threads class SMR info: _java_thread_list=0x00007f30a8002610, length=13, elements={ 0x00007f30ec025690, 0x00007f30ec0b3d30, 0x00007f30ec0b5ae0, 0x00007f30ec0baa40, 0x00007f30ec0bbdf0, 0x00007f30ec0bd200, 0x00007f30ec0bebb0, 0x00007f30ec0c00e0, 0x00007f30ec0c1550, 0x00007f30ec0c9350, 0x00007f30ec0cc3d0, 0x00007f30ec0e5c20, 0x00007f30a8001650 } "main" #1 prio=5 os_prio=0 cpu=2.19ms elapsed=12.30s tid=0x00007f30ec025690 nid=0x69b0 in Object.wait() [0x00007f30f392f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base at 17-internal/Native Method) - waiting on <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) at java.lang.Object.wait(java.base at 17-internal/Object.java:338) at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) - locked <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:108) at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:173) at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:93) at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:262) at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(java.base at 17-internal/MethodHandleNatives.java:285) at java.lang.invoke.MethodHandleNatives.linkCallSite(java.base at 17-internal/MethodHandleNatives.java:275) at jdk.crac.Core.checkpointRestore1(java.base at 17-internal/Core.java:176) at jdk.crac.Core.checkpointRestore(java.base at 17-internal/Core.java:264) - locked <0x00000000ca413f00> (a java.lang.Object) at jdk.crac.Core.checkpointRestore(java.base at 17-internal/Core.java:249) at FileWatcherAfterRestoreTest.exec(FileWatcherAfterRestoreTest.java:89) at jdk.test.lib.crac.CracTest.run(CracTest.java:157) at jdk.test.lib.crac.CracTest.main(CracTest.java:89) "Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.07ms elapsed=12.28s tid=0x00007f30ec0b3d30 nid=0x69b8 waiting on condition [0x00007f30deefc000] java.lang.Thread.State: RUNNABLE at java.lang.ref.Reference.waitForReferencePendingList(java.base at 17-internal/Native Method) at java.lang.ref.Reference.processPendingReferences(java.base at 17-internal/Reference.java:258) at java.lang.ref.Reference$ReferenceHandler.run(java.base at 17-internal/Reference.java:218) "Finalizer" #3 daemon prio=8 os_prio=0 cpu=0.09ms elapsed=12.28s tid=0x00007f30ec0b5ae0 nid=0x69b9 in Object.wait() [0x00007f30dedfb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base at 17-internal/Native Method) - waiting on <0x00000000ca4143f8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(java.base at 17-internal/ReferenceQueue.java:155) - locked <0x00000000ca4143f8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(java.base at 17-internal/ReferenceQueue.java:176) at java.lang.ref.Finalizer$FinalizerThread.run(java.base at 17-internal/Finalizer.java:172) "Signal Dispatcher" #4 daemon prio=9 os_prio=0 cpu=0.37ms elapsed=12.28s tid=0x00007f30ec0baa40 nid=0x69ba waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Service Thread" #5 daemon prio=9 os_prio=0 cpu=0.06ms elapsed=12.28s tid=0x00007f30ec0bbdf0 nid=0x69bb runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Monitor Deflation Thread" #6 daemon prio=9 os_prio=0 cpu=0.26ms elapsed=12.28s tid=0x00007f30ec0bd200 nid=0x69bc runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=0.12ms elapsed=12.28s tid=0x00007f30ec0bebb0 nid=0x69bd waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE No compile task "C1 CompilerThread0" #8 daemon prio=9 os_prio=0 cpu=2.12ms elapsed=12.28s tid=0x00007f30ec0c00e0 nid=0x69be waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE No compile task "Sweeper thread" #9 daemon prio=9 os_prio=0 cpu=0.07ms elapsed=12.28s tid=0x00007f30ec0c1550 nid=0x69bf runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Notification Thread" #10 daemon prio=9 os_prio=0 cpu=0.06ms elapsed=12.27s tid=0x00007f30ec0c9350 nid=0x69c0 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Common-Cleaner" #11 daemon prio=8 os_prio=0 cpu=0.08ms elapsed=12.27s tid=0x00007f30ec0cc3d0 nid=0x69c2 in Object.wait() [0x00007f30de2b7000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(java.base at 17-internal/Native Method) - waiting on <0x00000000ca428060> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(java.base at 17-internal/ReferenceQueue.java:155) - locked <0x00000000ca428060> (a java.lang.ref.ReferenceQueue$Lock) at jdk.internal.ref.CleanerImpl.run(java.base at 17-internal/CleanerImpl.java:144) at java.lang.Thread.run(java.base at 17-internal/Thread.java:833) at jdk.internal.misc.InnocuousThread.run(java.base at 17-internal/InnocuousThread.java:162) "FileSystemWatchService" #12 daemon prio=5 os_prio=0 cpu=0.06ms elapsed=12.25s tid=0x00007f30ec0e5c20 nid=0x69c3 in Object.wait() [0x00007f30ddd8d000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base at 17-internal/Native Method) - waiting on <0x00000000ca43abb8> (a sun.nio.fs.LinuxWatchService$Poller) at java.lang.Object.wait(java.base at 17-internal/Object.java:338) at sun.nio.fs.LinuxWatchService$Poller.processCheckpointRestore(java.base at 17-internal/LinuxWatchService.java:233) - locked <0x00000000ca43abb8> (a sun.nio.fs.LinuxWatchService$Poller) at sun.nio.fs.LinuxWatchService$Poller.run(java.base at 17-internal/LinuxWatchService.java:369) at java.lang.Thread.run(java.base at 17-internal/Thread.java:833) "Attach Listener" #13 daemon prio=9 os_prio=0 cpu=0.21ms elapsed=0.10s tid=0x00007f30a8001650 nid=0x6a13 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "VM Thread" os_prio=0 cpu=1.99ms elapsed=12.28s tid=0x00007f30ec0afe10 nid=0x69b7 runnable "GC Thread#0" os_prio=0 cpu=1.21ms elapsed=12.29s tid=0x00007f30ec050f80 nid=0x69b2 runnable "GC Thread#1" os_prio=0 cpu=0.41ms elapsed=12.21s tid=0x00007f30ac000f70 nid=0x69c4 runnable "G1 Main Marker" os_prio=0 cpu=0.27ms elapsed=12.29s tid=0x00007f30ec0590d0 nid=0x69b3 runnable "G1 Conc#0" os_prio=0 cpu=6.10ms elapsed=12.29s tid=0x00007f30ec05a030 nid=0x69b4 runnable "G1 Refine#0" os_prio=0 cpu=0.08ms elapsed=12.29s tid=0x00007f30ec0874e0 nid=0x69b5 runnable "G1 Service" os_prio=0 cpu=1.24ms elapsed=12.29s tid=0x00007f30ec0883d0 nid=0x69b6 runnable "VM Periodic Task Thread" os_prio=0 cpu=4.56ms elapsed=12.27s tid=0x00007f30ec0cac90 nid=0x69c1 waiting on condition JNI global refs: 7, weak refs: 0 ------------- Commit messages: - Remove code trigger register during checkpointRestore Changes: https://git.openjdk.org/crac/pull/75/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=75&range=00 Stats: 79 lines in 2 files changed: 75 ins; 0 del; 4 mod Patch: https://git.openjdk.org/crac/pull/75.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/75/head:pull/75 PR: https://git.openjdk.org/crac/pull/75 From duke at openjdk.org Thu May 25 08:36:21 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 25 May 2023 08:36:21 GMT Subject: [crac] RFR: Linux file system watcher support [v3] In-Reply-To: References: Message-ID: On Tue, 23 May 2023 15:28:47 GMT, joeylee wrote: >> inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. >> >> FileWatcherAfterRestoreTest verifies watcher service works after restore. >> FileWatcherTest verifies automatic closing inotiify fd >> >> The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest > > joeylee has updated the pull request incrementally with two additional commits since the last revision: > > - update test > - update test Alright, the test does not really show what I was asking for (an independent thread using the watch service and **reacting** to checkpoint triggered elsewhere, e.g. through `jcmd JDK.checkpoint`) but your comment explains that this change *might* be useful if the watch service is provided as static from a library beyond our control. Therefore I think that once the technicalities are resolved this could be integrated @AntonKozlov . src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 504: > 502: try { > 503: this.wait(); > 504: } catch (InterruptedException e) { You shouldn't just ignore the interrupt; interrupt should break a loop and probably enter errored state and rethrow. Someone wanted us to stop working. src/java.base/linux/classes/sun/nio/fs/LinuxWatchService.java line 524: > 522: try { > 523: this.wait(); > 524: } catch (InterruptedException e) { Same here. test/jdk/jdk/crac/fileDescriptors/FileWatcherAfterRestoreTest.java line 62: > 60: directory = Paths.get(System.getProperty("user.dir"), "workdir"); > 61: directory.toFile().mkdir(); > 62: Files.createTempFile(directory, "temp", ".txt"); You're creating the file inside `isMatchFound`, too. ------------- PR Review: https://git.openjdk.org/crac/pull/72#pullrequestreview-1443332992 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1205174591 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1205174952 PR Review Comment: https://git.openjdk.org/crac/pull/72#discussion_r1205159796 From akozlov at openjdk.org Thu May 25 08:49:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 08:49:23 GMT Subject: [crac] RFR: Remove code trigger register during checkpointRestore In-Reply-To: References: Message-ID: On Thu, 25 May 2023 07:33:16 GMT, joeylee wrote: > Register higher priority context during checkpoint could lead to dead lock, this patch removes the code that triggers register during checkpoint call. > > Thread dump for `CracOptionTest` without this patch > > 2023-05-23 18:10:39 > Full thread dump OpenJDK 64-Bit Server VM (17-internal+0-adhoc.ubuntu.jdk mixed mode, sharing): > > Threads class SMR info: > _java_thread_list=0x00007f30a8002610, length=13, elements={ > 0x00007f30ec025690, 0x00007f30ec0b3d30, 0x00007f30ec0b5ae0, 0x00007f30ec0baa40, > 0x00007f30ec0bbdf0, 0x00007f30ec0bd200, 0x00007f30ec0bebb0, 0x00007f30ec0c00e0, > 0x00007f30ec0c1550, 0x00007f30ec0c9350, 0x00007f30ec0cc3d0, 0x00007f30ec0e5c20, > 0x00007f30a8001650 > } > > "main" #1 prio=5 os_prio=0 cpu=2.19ms elapsed=12.30s tid=0x00007f30ec025690 nid=0x69b0 in Object.wait() [0x00007f30f392f000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base at 17-internal/Native Method) > - waiting on <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) > at java.lang.Object.wait(java.base at 17-internal/Object.java:338) > at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) > at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) > - locked <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) > at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:108) > at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:173) > at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:93) > at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) > at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) > at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:262) > at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) > at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) > at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) > at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) > at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) > at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(ja... LGTM, except few nits. src/java.base/share/classes/jdk/crac/Core.java line 178: > 176: > 177: if (newProperties != null && newProperties.length > 0) { > 178: Arrays.stream(newProperties).map(new Function() { A comment reffering to the issue and describing the solution (do not use lambda) would be helpful! test/jdk/jdk/crac/CracOptionTest.java line 57: > 55: directory = Paths.get(System.getProperty("user.dir"), "workdir"); > 56: directory.toFile().mkdir(); > 57: Files.createTempFile(directory, "temp", ".txt"); Are these lines are required for the test? It looks that the rest of the test passes -Dk=v to both checkpoint and restore, triggering the issue. But these seems not to be related. ------------- PR Review: https://git.openjdk.org/crac/pull/75#pullrequestreview-1443386078 PR Review Comment: https://git.openjdk.org/crac/pull/75#discussion_r1205194337 PR Review Comment: https://git.openjdk.org/crac/pull/75#discussion_r1205197191 From duke at openjdk.org Thu May 25 08:49:25 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 25 May 2023 08:49:25 GMT Subject: [crac] RFR: Remove code trigger register during checkpointRestore In-Reply-To: References: Message-ID: On Thu, 25 May 2023 07:33:16 GMT, joeylee wrote: > Register higher priority context during checkpoint could lead to dead lock, this patch removes the code that triggers register during checkpoint call. > > Thread dump for `CracOptionTest` without this patch > > 2023-05-23 18:10:39 > Full thread dump OpenJDK 64-Bit Server VM (17-internal+0-adhoc.ubuntu.jdk mixed mode, sharing): > > Threads class SMR info: > _java_thread_list=0x00007f30a8002610, length=13, elements={ > 0x00007f30ec025690, 0x00007f30ec0b3d30, 0x00007f30ec0b5ae0, 0x00007f30ec0baa40, > 0x00007f30ec0bbdf0, 0x00007f30ec0bd200, 0x00007f30ec0bebb0, 0x00007f30ec0c00e0, > 0x00007f30ec0c1550, 0x00007f30ec0c9350, 0x00007f30ec0cc3d0, 0x00007f30ec0e5c20, > 0x00007f30a8001650 > } > > "main" #1 prio=5 os_prio=0 cpu=2.19ms elapsed=12.30s tid=0x00007f30ec025690 nid=0x69b0 in Object.wait() [0x00007f30f392f000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base at 17-internal/Native Method) > - waiting on <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) > at java.lang.Object.wait(java.base at 17-internal/Object.java:338) > at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) > at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) > - locked <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) > at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:108) > at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:173) > at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:93) > at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) > at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) > at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:262) > at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) > at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) > at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) > at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) > at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) > at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(ja... Right, I had this in some code branches (awaiting as PR) but can be integrated independently. @AntonKozlov actually wants to fix this in #74 but that PR needs yet some more love. test/jdk/jdk/crac/CracOptionTest.java line 38: > 36: > 37: /** > 38: * @test DryRunTest Wrong test name. ------------- Marked as reviewed by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/75#pullrequestreview-1443391362 PR Review Comment: https://git.openjdk.org/crac/pull/75#discussion_r1205195428 From duke at openjdk.org Thu May 25 09:10:44 2023 From: duke at openjdk.org (joeylee) Date: Thu, 25 May 2023 09:10:44 GMT Subject: [crac] RFR: Remove code trigger register during checkpointRestore [v2] In-Reply-To: References: Message-ID: > Register higher priority context during checkpoint could lead to dead lock, this patch removes the code that triggers register during checkpoint call. > > Thread dump for `CracOptionTest` without this patch > > 2023-05-23 18:10:39 > Full thread dump OpenJDK 64-Bit Server VM (17-internal+0-adhoc.ubuntu.jdk mixed mode, sharing): > > Threads class SMR info: > _java_thread_list=0x00007f30a8002610, length=13, elements={ > 0x00007f30ec025690, 0x00007f30ec0b3d30, 0x00007f30ec0b5ae0, 0x00007f30ec0baa40, > 0x00007f30ec0bbdf0, 0x00007f30ec0bd200, 0x00007f30ec0bebb0, 0x00007f30ec0c00e0, > 0x00007f30ec0c1550, 0x00007f30ec0c9350, 0x00007f30ec0cc3d0, 0x00007f30ec0e5c20, > 0x00007f30a8001650 > } > > "main" #1 prio=5 os_prio=0 cpu=2.19ms elapsed=12.30s tid=0x00007f30ec025690 nid=0x69b0 in Object.wait() [0x00007f30f392f000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base at 17-internal/Native Method) > - waiting on <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) > at java.lang.Object.wait(java.base at 17-internal/Object.java:338) > at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) > at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) > - locked <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) > at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:108) > at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:173) > at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:93) > at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) > at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) > at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:262) > at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) > at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) > at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) > at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) > at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) > at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(ja... joeylee has updated the pull request incrementally with one additional commit since the last revision: update ------------- Changes: - all: https://git.openjdk.org/crac/pull/75/files - new: https://git.openjdk.org/crac/pull/75/files/ac5410f7..71378f6c Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=75&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=75&range=00-01 Stats: 8 lines in 2 files changed: 2 ins; 5 del; 1 mod Patch: https://git.openjdk.org/crac/pull/75.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/75/head:pull/75 PR: https://git.openjdk.org/crac/pull/75 From duke at openjdk.org Thu May 25 09:10:47 2023 From: duke at openjdk.org (joeylee) Date: Thu, 25 May 2023 09:10:47 GMT Subject: [crac] RFR: Remove code trigger register during checkpointRestore [v2] In-Reply-To: References: Message-ID: On Thu, 25 May 2023 08:45:56 GMT, Anton Kozlov wrote: >> joeylee has updated the pull request incrementally with one additional commit since the last revision: >> >> update > > test/jdk/jdk/crac/CracOptionTest.java line 57: > >> 55: directory = Paths.get(System.getProperty("user.dir"), "workdir"); >> 56: directory.toFile().mkdir(); >> 57: Files.createTempFile(directory, "temp", ".txt"); > > Are these lines are required for the test? It looks that the rest of the test passes -Dk=v to both checkpoint and restore, triggering the issue. But these seems not to be related. Right, still able to reproduce without these lines. Removed. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/75#discussion_r1205220407 From duke at openjdk.org Thu May 25 09:26:01 2023 From: duke at openjdk.org (joeylee) Date: Thu, 25 May 2023 09:26:01 GMT Subject: [crac] RFR: Linux file system watcher support [v4] In-Reply-To: References: Message-ID: <3Jl4KmUfedZAAzBzRZQZu1hJpBpdwvzilrUpKX4KcZo=.3c77ae78-59ee-481c-a0ff-41e03c32270a@github.com> > inotify monitors changes on filesystem, this support automatic restore for LinuxFileWatcher. > > FileWatcherAfterRestoreTest verifies watcher service works after restore. > FileWatcherTest verifies automatic closing inotiify fd > > The watcher keys are still managed by user, so exception will be thrown if no watcher keys are leaked, as in FileWatcherWithOpenKeysTest joeylee has updated the pull request incrementally with one additional commit since the last revision: update ------------- Changes: - all: https://git.openjdk.org/crac/pull/72/files - new: https://git.openjdk.org/crac/pull/72/files/40191d9b..d764613d Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=72&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=72&range=02-03 Stats: 5 lines in 2 files changed: 4 ins; 1 del; 0 mod Patch: https://git.openjdk.org/crac/pull/72.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/72/head:pull/72 PR: https://git.openjdk.org/crac/pull/72 From akozlov at openjdk.org Thu May 25 12:02:25 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 12:02:25 GMT Subject: [crac] RFR: Remove code trigger register during checkpointRestore [v2] In-Reply-To: References: Message-ID: <4hz71wPij3f8qlD3R5PVrGXbF-vJMhWY8amx7mSEuBQ=.9a4e30b7-c502-4871-9c3d-a60cb6e84e0b@github.com> On Thu, 25 May 2023 09:10:44 GMT, joeylee wrote: >> Register higher priority context during checkpoint could lead to dead lock, this patch removes the code that triggers register during checkpoint call. >> >> Thread dump for `CracOptionTest` without this patch >> >> 2023-05-23 18:10:39 >> Full thread dump OpenJDK 64-Bit Server VM (17-internal+0-adhoc.ubuntu.jdk mixed mode, sharing): >> >> Threads class SMR info: >> _java_thread_list=0x00007f30a8002610, length=13, elements={ >> 0x00007f30ec025690, 0x00007f30ec0b3d30, 0x00007f30ec0b5ae0, 0x00007f30ec0baa40, >> 0x00007f30ec0bbdf0, 0x00007f30ec0bd200, 0x00007f30ec0bebb0, 0x00007f30ec0c00e0, >> 0x00007f30ec0c1550, 0x00007f30ec0c9350, 0x00007f30ec0cc3d0, 0x00007f30ec0e5c20, >> 0x00007f30a8001650 >> } >> >> "main" #1 prio=5 os_prio=0 cpu=2.19ms elapsed=12.30s tid=0x00007f30ec025690 nid=0x69b0 in Object.wait() [0x00007f30f392f000] >> java.lang.Thread.State: WAITING (on object monitor) >> at java.lang.Object.wait(java.base at 17-internal/Native Method) >> - waiting on <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) >> at java.lang.Object.wait(java.base at 17-internal/Object.java:338) >> at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) >> at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) >> - locked <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) >> at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:108) >> at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:173) >> at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:93) >> at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) >> at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) >> at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:262) >> at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) >> at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) >> at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) >> at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) >> at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) >> at java.lang.invoke.Met... > > joeylee has updated the pull request incrementally with one additional commit since the last revision: > > update Marked as reviewed by akozlov (Lead). ------------- PR Review: https://git.openjdk.org/crac/pull/75#pullrequestreview-1443726043 From akozlov at openjdk.org Thu May 25 12:18:39 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 12:18:39 GMT Subject: [crac] RFR: Do not register MethodHandleNatives Message-ID: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> A simple solution for the lambda problem in the CRaC Core: do not register CALL_SITES at all. A second part of the PR introduces an interface for Cleaner.register() with priority parameter. ------------- Commit messages: - Use internal interfaces for registration - Do not register MethodHandleNatives Changes: https://git.openjdk.org/crac/pull/76/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=76&range=00 Stats: 39 lines in 6 files changed: 28 ins; 8 del; 3 mod Patch: https://git.openjdk.org/crac/pull/76.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/76/head:pull/76 PR: https://git.openjdk.org/crac/pull/76 From akozlov at openjdk.org Thu May 25 12:44:28 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 12:44:28 GMT Subject: [crac] Integrated: Ensure all notifications finish even if only daemon threads remain In-Reply-To: References: Message-ID: On Thu, 4 May 2023 13:34:47 GMT, Anton Kozlov wrote: > If as a result of beforeCheckpoint() no more non-daemon threads remain, it's possible that VM exits prematurely, before one of afterRestore() get a chance to create another non-daemon thread that will keep VM alive. Triggering checkpoint via jcmd (so checkpointRestore() method is executed on daemon attach-listener thread), increases probability to step in the problem. > > The change ensures all notifications are done while there is at least one non-daemon thread. The notification methods are still called from the original thread. This pull request has now been integrated. Changeset: 68de4bed Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/68de4bed4e3984929ce3dfad29565d24d04b0790 Stats: 203 lines in 4 files changed: 188 ins; 7 del; 8 mod Ensure all notifications finish even if only daemon threads remain Co-authored-by: Radim Vansa ------------- PR: https://git.openjdk.org/crac/pull/62 From heidinga at openjdk.org Thu May 25 13:17:28 2023 From: heidinga at openjdk.org (Dan Heidinga) Date: Thu, 25 May 2023 13:17:28 GMT Subject: [crac] RFR: Do not register MethodHandleNatives In-Reply-To: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> References: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> Message-ID: <6lzE0_F3CFyFrfYDBUubclkCoJVXWI-PYAmR8GQC8Jk=.c4050741-5224-4447-81c0-728fb90dcce3@github.com> On Thu, 25 May 2023 12:12:24 GMT, Anton Kozlov wrote: > A simple solution for the lambda problem in the CRaC Core: do not register CALL_SITES at all. > > A second part of the PR introduces an interface for Cleaner.register() with priority parameter. src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java line 94: > 92: // CleanerFactory class) until cleanup is performed. > 93: // This PhantomCleanableRef is not registered in any Context as > 94: // registration cauesed by the core CRaC code leads to deadlock. Suggestion: // registration caused by the core CRaC code leads to deadlock. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/76#discussion_r1205505791 From akozlov at openjdk.org Thu May 25 13:18:26 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 13:18:26 GMT Subject: [crac] RFR: List open FDs through reading /proc/self/fd In-Reply-To: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> References: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> Message-ID: On Fri, 12 May 2023 07:36:33 GMT, Radim Vansa wrote: > Previously the code was iterating through all possible FD values, up to highest allowed FD number, and required allocation of possibly huge array. Reading /proc/self/fd into a compact array is both more memory efficient and does not require excessive syscalls. Mostly LGTM! src/hotspot/os/linux/os_linux.cpp line 193: > 191: }; > 192: > 193: bool same_fd(int fd1, int fd2); The declaration has wrong parameter names. src/hotspot/os/linux/os_linux.cpp line 5814: > 5812: // skip "." and ".." > 5813: continue; > 5814: } May be just look at the first char `== '.'`? ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/67#pullrequestreview-1443824365 PR Review Comment: https://git.openjdk.org/crac/pull/67#discussion_r1205496914 PR Review Comment: https://git.openjdk.org/crac/pull/67#discussion_r1205475025 From akozlov at openjdk.org Thu May 25 13:24:09 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 13:24:09 GMT Subject: [crac] RFR: Do not register MethodHandleNatives [v2] In-Reply-To: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> References: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> Message-ID: <0K2wND-L7KI8xQyP3RmffiXSqTNy9FwcsHQ2lqAGnkc=.08df7ac9-1e16-4d21-9dbc-51f7aefa170b@github.com> > A simple solution for the lambda problem in the CRaC Core: do not register CALL_SITES at all. > > A second part of the PR introduces an interface for Cleaner.register() with priority parameter. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Update src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java Co-authored-by: Dan Heidinga ------------- Changes: - all: https://git.openjdk.org/crac/pull/76/files - new: https://git.openjdk.org/crac/pull/76/files/9e4291d5..dc83f2be Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=76&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=76&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/crac/pull/76.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/76/head:pull/76 PR: https://git.openjdk.org/crac/pull/76 From akozlov at openjdk.org Thu May 25 13:24:10 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 13:24:10 GMT Subject: [crac] RFR: Do not register MethodHandleNatives [v2] In-Reply-To: <6lzE0_F3CFyFrfYDBUubclkCoJVXWI-PYAmR8GQC8Jk=.c4050741-5224-4447-81c0-728fb90dcce3@github.com> References: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> <6lzE0_F3CFyFrfYDBUubclkCoJVXWI-PYAmR8GQC8Jk=.c4050741-5224-4447-81c0-728fb90dcce3@github.com> Message-ID: On Thu, 25 May 2023 13:15:03 GMT, Dan Heidinga wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java >> >> Co-authored-by: Dan Heidinga > > src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java line 94: > >> 92: // CleanerFactory class) until cleanup is performed. >> 93: // This PhantomCleanableRef is not registered in any Context as >> 94: // registration cauesed by the core CRaC code leads to deadlock. > > Suggestion: > > // registration caused by the core CRaC code leads to deadlock. Thanks! :) ------------- PR Review Comment: https://git.openjdk.org/crac/pull/76#discussion_r1205510311 From duke at openjdk.org Thu May 25 13:58:30 2023 From: duke at openjdk.org (joeylee) Date: Thu, 25 May 2023 13:58:30 GMT Subject: [crac] RFR: Remove code trigger register during checkpointRestore [v2] In-Reply-To: <4hz71wPij3f8qlD3R5PVrGXbF-vJMhWY8amx7mSEuBQ=.9a4e30b7-c502-4871-9c3d-a60cb6e84e0b@github.com> References: <4hz71wPij3f8qlD3R5PVrGXbF-vJMhWY8amx7mSEuBQ=.9a4e30b7-c502-4871-9c3d-a60cb6e84e0b@github.com> Message-ID: On Thu, 25 May 2023 11:59:32 GMT, Anton Kozlov wrote: >> joeylee has updated the pull request incrementally with one additional commit since the last revision: >> >> update > > Marked as reviewed by akozlov (Lead). @AntonKozlov @rvansa Could you help sponsor this? thanks! ------------- PR Comment: https://git.openjdk.org/crac/pull/75#issuecomment-1562955309 From duke at openjdk.org Thu May 25 14:24:32 2023 From: duke at openjdk.org (joeylee) Date: Thu, 25 May 2023 14:24:32 GMT Subject: [crac] Integrated: Remove code trigger register during checkpointRestore In-Reply-To: References: Message-ID: On Thu, 25 May 2023 07:33:16 GMT, joeylee wrote: > Register higher priority context during checkpoint could lead to dead lock, this patch removes the code that triggers register during checkpoint call. > > Thread dump for `CracOptionTest` without this patch > > 2023-05-23 18:10:39 > Full thread dump OpenJDK 64-Bit Server VM (17-internal+0-adhoc.ubuntu.jdk mixed mode, sharing): > > Threads class SMR info: > _java_thread_list=0x00007f30a8002610, length=13, elements={ > 0x00007f30ec025690, 0x00007f30ec0b3d30, 0x00007f30ec0b5ae0, 0x00007f30ec0baa40, > 0x00007f30ec0bbdf0, 0x00007f30ec0bd200, 0x00007f30ec0bebb0, 0x00007f30ec0c00e0, > 0x00007f30ec0c1550, 0x00007f30ec0c9350, 0x00007f30ec0cc3d0, 0x00007f30ec0e5c20, > 0x00007f30a8001650 > } > > "main" #1 prio=5 os_prio=0 cpu=2.19ms elapsed=12.30s tid=0x00007f30ec025690 nid=0x69b0 in Object.wait() [0x00007f30f392f000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base at 17-internal/Native Method) > - waiting on <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) > at java.lang.Object.wait(java.base at 17-internal/Object.java:338) > at jdk.crac.impl.AbstractContextImpl.waitWhileCheckpointIsInProgress(java.base at 17-internal/AbstractContextImpl.java:102) > at jdk.crac.impl.PriorityContext.register(java.base at 17-internal/PriorityContext.java:44) > - locked <0x00000000ca413e78> (a jdk.internal.crac.JDKContext) > at jdk.internal.crac.JDKContext.register(java.base at 17-internal/JDKContext.java:108) > at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.(java.base at 17-internal/CleanerImpl.java:173) > at java.lang.invoke.MethodHandleNatives$CallSiteContext.make(java.base at 17-internal/MethodHandleNatives.java:93) > at java.lang.invoke.CallSite.(java.base at 17-internal/CallSite.java:144) > at java.lang.invoke.ConstantCallSite.(java.base at 17-internal/ConstantCallSite.java:50) > at java.lang.invoke.InnerClassLambdaMetafactory.buildCallSite(java.base at 17-internal/InnerClassLambdaMetafactory.java:262) > at java.lang.invoke.LambdaMetafactory.metafactory(java.base at 17-internal/LambdaMetafactory.java:341) > at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 17-internal/DirectMethodHandle$Holder) > at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 17-internal/Invokers$Holder) > at java.lang.invoke.BootstrapMethodInvoker.invoke(java.base at 17-internal/BootstrapMethodInvoker.java:134) > at java.lang.invoke.CallSite.makeSite(java.base at 17-internal/CallSite.java:315) > at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(ja... This pull request has now been integrated. Changeset: e80da081 Author: joeyleeeeeee97 Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/e80da0811aa4e1107a43c9db9d4e0d0a4ba5271a Stats: 76 lines in 2 files changed: 72 ins; 0 del; 4 mod Remove code trigger register during checkpointRestore Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/75 From duke at openjdk.org Thu May 25 14:50:30 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 25 May 2023 14:50:30 GMT Subject: [crac] RFR: List open FDs through reading /proc/self/fd In-Reply-To: References: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> Message-ID: On Thu, 25 May 2023 12:52:46 GMT, Anton Kozlov wrote: >> Previously the code was iterating through all possible FD values, up to highest allowed FD number, and required allocation of possibly huge array. Reading /proc/self/fd into a compact array is both more memory efficient and does not require excessive syscalls. > > src/hotspot/os/linux/os_linux.cpp line 5814: > >> 5812: // skip "." and ".." >> 5813: continue; >> 5814: } > > May be just look at the first char `== '.'`? I think that checking parse-ability is better, but I don't insist, the dir should not contain anything else. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/67#discussion_r1205633273 From duke at openjdk.org Thu May 25 14:57:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Thu, 25 May 2023 14:57:44 GMT Subject: [crac] RFR: List open FDs through reading /proc/self/fd [v2] In-Reply-To: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> References: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> Message-ID: > Previously the code was iterating through all possible FD values, up to highest allowed FD number, and required allocation of possibly huge array. Reading /proc/self/fd into a compact array is both more memory efficient and does not require excessive syscalls. Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: Last touches ------------- Changes: - all: https://git.openjdk.org/crac/pull/67/files - new: https://git.openjdk.org/crac/pull/67/files/019650b6..e547cc64 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=67&range=01 - incr: https://webrevs.openjdk.org/?repo=crac&pr=67&range=00-01 Stats: 7 lines in 1 file changed: 3 ins; 2 del; 2 mod Patch: https://git.openjdk.org/crac/pull/67.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/67/head:pull/67 PR: https://git.openjdk.org/crac/pull/67 From akozlov at openjdk.org Thu May 25 15:27:33 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 15:27:33 GMT Subject: [crac] RFR: Do not register MethodHandleNatives [v2] In-Reply-To: <0K2wND-L7KI8xQyP3RmffiXSqTNy9FwcsHQ2lqAGnkc=.08df7ac9-1e16-4d21-9dbc-51f7aefa170b@github.com> References: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> <0K2wND-L7KI8xQyP3RmffiXSqTNy9FwcsHQ2lqAGnkc=.08df7ac9-1e16-4d21-9dbc-51f7aefa170b@github.com> Message-ID: On Thu, 25 May 2023 13:24:09 GMT, Anton Kozlov wrote: >> A simple solution for the lambda problem in the CRaC Core: do not register CALL_SITES at all. >> >> A second part of the PR introduces an interface for Cleaner.register() with priority parameter. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Update src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java > > Co-authored-by: Dan Heidinga Tests fail by independent reason (see #77) ------------- PR Comment: https://git.openjdk.org/crac/pull/76#issuecomment-1563100073 From akozlov at openjdk.org Thu May 25 16:02:38 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Thu, 25 May 2023 16:02:38 GMT Subject: [crac] RFR: List open FDs through reading /proc/self/fd [v2] In-Reply-To: References: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> Message-ID: On Thu, 25 May 2023 14:57:44 GMT, Radim Vansa wrote: >> Previously the code was iterating through all possible FD values, up to highest allowed FD number, and required allocation of possibly huge array. Reading /proc/self/fd into a compact array is both more memory efficient and does not require excessive syscalls. > > Radim Vansa has updated the pull request incrementally with one additional commit since the last revision: > > Last touches Thanks! ------------- Marked as reviewed by akozlov (Lead). PR Review: https://git.openjdk.org/crac/pull/67#pullrequestreview-1444223762 From duke at openjdk.org Fri May 26 07:02:26 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 26 May 2023 07:02:26 GMT Subject: [crac] RFR: Do not register MethodHandleNatives [v2] In-Reply-To: <0K2wND-L7KI8xQyP3RmffiXSqTNy9FwcsHQ2lqAGnkc=.08df7ac9-1e16-4d21-9dbc-51f7aefa170b@github.com> References: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> <0K2wND-L7KI8xQyP3RmffiXSqTNy9FwcsHQ2lqAGnkc=.08df7ac9-1e16-4d21-9dbc-51f7aefa170b@github.com> Message-ID: On Thu, 25 May 2023 13:24:09 GMT, Anton Kozlov wrote: >> A simple solution for the lambda problem in the CRaC Core: do not register CALL_SITES at all. >> >> A second part of the PR introduces an interface for Cleaner.register() with priority parameter. > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Update src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java > > Co-authored-by: Dan Heidinga LGTM, except the few typos and naming. src/java.base/share/classes/java/lang/ref/Cleaner.java line 225: > 223: > 224: /** > 225: * Register an object and object and also register the underlying Reference with a CRaC priority. an object and action src/java.base/share/classes/jdk/internal/access/JavaLangRefAccess.java line 52: > 50: > 51: /** > 52: * Registers an object and an action in a cleaner, with action synhronized with a CRaC priority. typo: `synhronized` src/java.base/share/classes/jdk/internal/access/JavaLangRefAccess.java line 54: > 52: * Registers an object and an action in a cleaner, with action synhronized with a CRaC priority. > 53: */ > 54: Cleaner.Cleanable register(Cleaner cleaner, Object obj, Runnable action, JDKResource.Priority priority); The method should be probably called something like `registerCleanable`, or `createCleanable`, given that this is a package-wide access. ------------- Marked as reviewed by rvansa at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/crac/pull/76#pullrequestreview-1444097256 PR Review Comment: https://git.openjdk.org/crac/pull/76#discussion_r1206312316 PR Review Comment: https://git.openjdk.org/crac/pull/76#discussion_r1205647804 PR Review Comment: https://git.openjdk.org/crac/pull/76#discussion_r1205651624 From duke at openjdk.org Fri May 26 09:35:28 2023 From: duke at openjdk.org (Radim Vansa) Date: Fri, 26 May 2023 09:35:28 GMT Subject: [crac] Integrated: List open FDs through reading /proc/self/fd In-Reply-To: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> References: <8XFKhgK6WTjIrbDLGm58oyeKWhTIQbYRW63Hyg6pk3Q=.9f07ed12-b98f-4af0-9600-ec765cfd80bf@github.com> Message-ID: On Fri, 12 May 2023 07:36:33 GMT, Radim Vansa wrote: > Previously the code was iterating through all possible FD values, up to highest allowed FD number, and required allocation of possibly huge array. Reading /proc/self/fd into a compact array is both more memory efficient and does not require excessive syscalls. This pull request has now been integrated. Changeset: ab3c5123 Author: Radim Vansa Committer: Anton Kozlov URL: https://git.openjdk.org/crac/commit/ab3c51235038c92f06668c227d8f720d85c0de82 Stats: 102 lines in 1 file changed: 27 ins; 23 del; 52 mod List open FDs through reading /proc/self/fd Reviewed-by: akozlov ------------- PR: https://git.openjdk.org/crac/pull/67 From akozlov at openjdk.org Fri May 26 09:40:05 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 26 May 2023 09:40:05 GMT Subject: [crac] RFR: Do not register MethodHandleNatives [v3] In-Reply-To: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> References: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> Message-ID: > A simple solution for the lambda problem in the CRaC Core: do not register CALL_SITES at all. > > A second part of the PR introduces an interface for Cleaner.register() with priority parameter. Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Update ------------- Changes: - all: https://git.openjdk.org/crac/pull/76/files - new: https://git.openjdk.org/crac/pull/76/files/dc83f2be..d33323c3 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=76&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=76&range=01-02 Stats: 5 lines in 4 files changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/crac/pull/76.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/76/head:pull/76 PR: https://git.openjdk.org/crac/pull/76 From akozlov at openjdk.org Fri May 26 11:56:37 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 26 May 2023 11:56:37 GMT Subject: [crac] RFR: Do not register MethodHandleNatives [v4] In-Reply-To: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> References: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> Message-ID: > A simple solution for the lambda problem in the CRaC Core: do not register CALL_SITES at all. > > A second part of the PR introduces an interface for Cleaner.register() with priority parameter. Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'openjdk:crac' into do-not-register-mhn - Update - Update src/java.base/share/classes/java/lang/invoke/MethodHandleNatives.java Co-authored-by: Dan Heidinga - Use internal interfaces for registration - Do not register MethodHandleNatives ------------- Changes: - all: https://git.openjdk.org/crac/pull/76/files - new: https://git.openjdk.org/crac/pull/76/files/d33323c3..62164cd3 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=76&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=76&range=02-03 Stats: 381 lines in 6 files changed: 287 ins; 30 del; 64 mod Patch: https://git.openjdk.org/crac/pull/76.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/76/head:pull/76 PR: https://git.openjdk.org/crac/pull/76 From akozlov at openjdk.org Fri May 26 15:20:35 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 26 May 2023 15:20:35 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v22] In-Reply-To: References: Message-ID: On Mon, 22 May 2023 11:48:22 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with four additional commits since the last revision: > > - refactor > - refactor > - refactor > - Fix el7 compatibility. src/hotspot/cpu/x86/vm_version_x86.cpp line 1297: > 1295: } > 1296: > 1297: glibc_not_using((CPU_MAX - 1) & ~_features, (GLIBC_MAX - 1) & ~_glibc_features); Apparently this triggers re-exec even with default command line parameters, which does not look expected, see below. Also, we need a way to disable CPU feature reduction as a workaround to possible bugs in the implementation. $ ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:+ShowCPUFeatures -version This machine's CPU features are: -XX:CPUFeatures=0x61805fdfbff,0x1e6 Re-exec of java with new environment variable: GLIBC_TUNABLES=:glibc.cpu.hwcaps=,-RTM,-AVX512F,-AVX512CD,-AVX512BW,-AVX512DQ,-AVX512ER,-AVX512PF,-AVX512VL,-IBT,-FMA4,-SHSTK This machine's CPU features are: -XX:CPUFeatures=0x61805fdfbff,0x1e6 Environment variable already set, both glibc CPU_FEATURE_ACTIVE and ld.so --list-diagnostics are unavailable - re-exec suppressed: GLIBC_TUNABLES=:glibc.cpu.hwcaps=,-RTM,-AVX512F,-AVX512CD,-AVX512BW,-AVX512DQ,-AVX512ER,-AVX512PF,-AVX512VL,-IBT,-FMA4,-SHSTK openjdk version "17-internal" 2021-09-14 OpenJDK Runtime Environment (build 17-internal+0-adhoc..crac) OpenJDK 64-Bit Server VM (build 17-internal+0-adhoc..crac, mixed mode, sharing) ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1206944881 From akozlov at openjdk.org Fri May 26 15:26:27 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Fri, 26 May 2023 15:26:27 GMT Subject: [crac] Integrated: Do not register MethodHandleNatives In-Reply-To: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> References: <3DSagRsUhYjN-NWSz1b8U-MEkV907zmdouRhsWUF20w=.63dd233b-8193-44a8-aac5-afdebb05bdf1@github.com> Message-ID: On Thu, 25 May 2023 12:12:24 GMT, Anton Kozlov wrote: > A simple solution for the lambda problem in the CRaC Core: do not register CALL_SITES at all. > > A second part of the PR introduces an interface for Cleaner.register() with priority parameter. This pull request has now been integrated. Changeset: cdf4c35d Author: Anton Kozlov URL: https://git.openjdk.org/crac/commit/cdf4c35d7ba74690421648b1faa9d0ef66665a3c Stats: 39 lines in 6 files changed: 28 ins; 8 del; 3 mod Do not register MethodHandleNatives ------------- PR: https://git.openjdk.org/crac/pull/76 From akozlov at openjdk.org Mon May 29 17:08:27 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 29 May 2023 17:08:27 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies In-Reply-To: References: Message-ID: On Thu, 25 May 2023 06:22:08 GMT, Radim Vansa wrote: >> Apparently you mean why the global context implementation has been changed, not why the test has to use an explicit implementation. >> >> Honestly I did not realize the global context properties were changed in #60, so I just want to revert that for a while. This was rather big change. And that deserved a separate line in the javadoc, which is btw the specification, not the tests (although they are definetely useful). I lean toward using BlockingContext, but want to think a bit more about that, with the spec change. > > Yes, the behaviour should be specified through javadoc but that does not say anything about registration when the checkpoint is proceeded. Per your logic that behaviour is unspecified and hence the change doesn't matter. Moreover, you've explicitly asked to postpone javadoc changes into #65 which did not get review in 2 weeks. > > Rather than changing implementation forth and back you could explain why certain behaviour is more useful. Now you've removed a test for certain codepath - creating resource in global context - if you think it should behave in a different way you should keep it and assert different outcome. Having that we've met so many self-deadlocks after the blocking was introduced, it seems it will happen with the users who will run into the same problem with the global context, which we suggest by default. I'm still trying to untagle with consequences of #60, with this PR in particular. You can see the policy for the global context is going to be specified by a single line of code. Anyway sorry for reviews taking so long. You can simplify reviewer job by separating functional and non-functional changes. Rephrasing javadoc for readability is non functional, changing the meaning is functional. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1209468675 From akozlov at openjdk.org Mon May 29 18:22:39 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 29 May 2023 18:22:39 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v2] In-Reply-To: References: Message-ID: > A follow-up work for #60: > > * Each priority now has a dedicated context, so contextes may provide different policies. CALL_SITE now uses new CriticalUnorderedContext that runs beforeCheckpoint on concurrent registration, fixes [1]. Whether or not CALL_SITE needs to be registered to at all is an open question and out of scope of this PR. > * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. > * hierachy of the Context implementations are cleaned up a bit [2] > > The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. > > [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 > [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: - Merge remote-tracking branch 'a/crac/context-update' into context-update - Cleanup - Cleanup - Update - All done - Revert global Context - Final touches - Interrupt does not work - Update test - Merge remote-tracking branch 'jdk/crac/crac' into context-update - ... and 8 more: https://git.openjdk.org/crac/compare/cdf4c35d...5182ba92 ------------- Changes: https://git.openjdk.org/crac/pull/74/files Webrev: https://webrevs.openjdk.org/?repo=crac&pr=74&range=01 Stats: 1011 lines in 30 files changed: 375 ins; 551 del; 85 mod Patch: https://git.openjdk.org/crac/pull/74.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/74/head:pull/74 PR: https://git.openjdk.org/crac/pull/74 From akozlov at openjdk.org Mon May 29 18:28:31 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 29 May 2023 18:28:31 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v2] In-Reply-To: References: Message-ID: On Thu, 18 May 2023 11:57:28 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: >> >> - Merge remote-tracking branch 'a/crac/context-update' into context-update >> - Cleanup >> - Cleanup >> - Update >> - All done >> - Revert global Context >> - Final touches >> - Interrupt does not work >> - Update test >> - Merge remote-tracking branch 'jdk/crac/crac' into context-update >> - ... and 8 more: https://git.openjdk.org/crac/compare/cdf4c35d...5182ba92 > > src/java.base/share/classes/jdk/crac/impl/CriticalUnorderedContext.java line 85: > >> 83: invokeBeforeCheckpoint(resource); >> 84: } catch (Exception e) { >> 85: concurrentCheckpointException.handle(e); > > I **really** dislike the fact that the exception is reported only during restore, which may never happen. The checkpoint should be marked for failure in here. This makes sense. As a thought excercise, it is possible to have add a last Resource, that will query all pending exceptions in all previous contextes, or make this context block once all beforeCheckpoint are done and there is nowhere to report exceptions. Both are also possible for user contextes, outside of the system classes. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1209500968 From akozlov at openjdk.org Mon May 29 18:32:23 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Mon, 29 May 2023 18:32:23 GMT Subject: [crac] RFR: Use special class for exception aggregates [v4] In-Reply-To: <6DVU9HYy46e5dglqxqUXNfl17sKEM7XuP1H5kplhEC8=.caead7f0-602c-4fa3-a1e5-97b09a50beb4@github.com> References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> <6DVU9HYy46e5dglqxqUXNfl17sKEM7XuP1H5kplhEC8=.caead7f0-602c-4fa3-a1e5-97b09a50beb4@github.com> Message-ID: On Fri, 19 May 2023 06:52:47 GMT, Radim Vansa wrote: >> Yes, that is a bug javax.crac does not follow jdk.crac. >> >> Agree about the constructor without args. >> >> And still want to delete the a constructor with message. We can introduce a jdk.crac.impl.CheckpointMessageException (along j.c.i.CheckpointOpenResourceException) which we use to communicate different reason(s) checkpoint is not successful. > > What about turning the inheritance the other way: final CheckpointAggregateException extends CheckpointException? This would be easily distinguishable, no need to change anything on Context interface. The no-arg constructor in CheckpointException would be protected. > It's kind of natural that exceptions carry messages. What will be the point of CheckpointException then? A more specific exception will also be preferable instead of that, isn't it? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1209502345 From duke at openjdk.org Tue May 30 06:22:43 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 30 May 2023 06:22:43 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v2] In-Reply-To: References: Message-ID: On Mon, 29 May 2023 17:05:47 GMT, Anton Kozlov wrote: >> Yes, the behaviour should be specified through javadoc but that does not say anything about registration when the checkpoint is proceeded. Per your logic that behaviour is unspecified and hence the change doesn't matter. Moreover, you've explicitly asked to postpone javadoc changes into #65 which did not get review in 2 weeks. >> >> Rather than changing implementation forth and back you could explain why certain behaviour is more useful. Now you've removed a test for certain codepath - creating resource in global context - if you think it should behave in a different way you should keep it and assert different outcome. > > Having that we've met so many self-deadlocks after the blocking was introduced, it seems it will happen with the users who will run into the same problem with the global context, which we suggest by default. I'm still trying to untagle with consequences of #60, with this PR in particular. You can see the policy for the global context is going to be specified by a single line of code. Anyway sorry for reviews taking so long. You can simplify reviewer job by separating functional and non-functional changes. Rephrasing javadoc for readability is non functional, changing the meaning is functional. One of the points in #60 was to make user errors *visible*, rather than silently passing - that was one reason I pushed that PR hard. Now looking again into `OrderedContext` that's what this PR brings and I strongly disagree. There's two ways to cope with those errors in the order of registration I know of: either throw an error (and I don't mind if it's from the `Core.checkpointRestore` or straight from `register`) or block, and you've chosen the latter option, stating that the problem is clearly visible from the threaddump. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1209754899 From duke at openjdk.org Tue May 30 06:27:52 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 30 May 2023 06:27:52 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v2] In-Reply-To: References: Message-ID: On Mon, 29 May 2023 18:26:02 GMT, Anton Kozlov wrote: >> src/java.base/share/classes/jdk/crac/impl/CriticalUnorderedContext.java line 85: >> >>> 83: invokeBeforeCheckpoint(resource); >>> 84: } catch (Exception e) { >>> 85: concurrentCheckpointException.handle(e); >> >> I **really** dislike the fact that the exception is reported only during restore, which may never happen. The checkpoint should be marked for failure in here. > > This makes sense. As a thought excercise, it is possible to have add a last Resource, that will query all pending exceptions in all previous contextes, or make this context block once all beforeCheckpoint are done and there is nowhere to report exceptions. Both are also possible for user contextes, outside of the system classes. If you're speaking about querying Contexts, that means extending its API for traversing and fetching those. Alternatively, should the contexts report exceptions to a well-known resource this resource would become a part of the API. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1209758424 From duke at openjdk.org Tue May 30 06:35:44 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 30 May 2023 06:35:44 GMT Subject: [crac] RFR: Use special class for exception aggregates [v4] In-Reply-To: References: <_AMDGDW7jBHD0sWCap1QKdw3fpk8l4yG8SquNCZdRRY=.efb87be3-45f8-4c99-b470-b7cba2669ed7@github.com> <_HutEHb0_FgiGAmImfmICJMIl2KJX8IlDVwpxTdAw28=.db1c4633-74a9-462f-a809-7ef3c4cc236f@github.com> <6DVU9HYy46e5dglqxqUXNfl17sKEM7XuP1H5kplhEC8=.caead7f0-602c-4fa3-a1e5-97b09a50beb4@github.com> Message-ID: On Mon, 29 May 2023 18:29:38 GMT, Anton Kozlov wrote: >> What about turning the inheritance the other way: final CheckpointAggregateException extends CheckpointException? This would be easily distinguishable, no need to change anything on Context interface. The no-arg constructor in CheckpointException would be protected. >> It's kind of natural that exceptions carry messages. > > What will be the point of CheckpointException then? A more specific exception will also be preferable instead of that, isn't it? `Core` would throw a generic exception when something failed in the native part (`criuengine` returned exit code 1...). A custom implementation of `Context` would throw it when it can't call its children for some reason (but it's not a failure in the Resource itself). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/64#discussion_r1209764923 From akozlov at openjdk.org Tue May 30 10:26:35 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 30 May 2023 10:26:35 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v2] In-Reply-To: References: Message-ID: On Tue, 30 May 2023 06:25:08 GMT, Radim Vansa wrote: >> This makes sense. As a thought excercise, it is possible to have add a last Resource, that will query all pending exceptions in all previous contextes, or make this context block once all beforeCheckpoint are done and there is nowhere to report exceptions. Both are also possible for user contextes, outside of the system classes. > > If you're speaking about querying Contexts, that means extending its API for traversing and fetching those. Alternatively, should the contexts report exceptions to a well-known resource this resource would become a part of the API. It is possible to add query method to JDK contextes without changing public context interface. The same way, the alternative won't require changes in the public API (it will be a convention between JDK contextexes and the well-known resource). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1210065601 From duke at openjdk.org Tue May 30 11:01:27 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 30 May 2023 11:01:27 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v2] In-Reply-To: References: Message-ID: On Tue, 30 May 2023 10:23:52 GMT, Anton Kozlov wrote: >> If you're speaking about querying Contexts, that means extending its API for traversing and fetching those. Alternatively, should the contexts report exceptions to a well-known resource this resource would become a part of the API. > > It is possible to add query method to JDK contextes without changing public context interface. The same way, the alternative won't require changes in the public API (it will be a convention between JDK contextexes and the well-known resource). Sure, if we're talking only about JDKContext and this is used nowhere except in JDKContext the options are wider, and your proposed solution would work. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1210102573 From akozlov at openjdk.org Tue May 30 12:39:25 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 30 May 2023 12:39:25 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v2] In-Reply-To: References: Message-ID: On Tue, 30 May 2023 06:20:24 GMT, Radim Vansa wrote: >> Having that we've met so many self-deadlocks after the blocking was introduced, it seems it will happen with the users who will run into the same problem with the global context, which we suggest by default. I'm still trying to untagle with consequences of #60, with this PR in particular. You can see the policy for the global context is going to be specified by a single line of code. Anyway sorry for reviews taking so long. You can simplify reviewer job by separating functional and non-functional changes. Rephrasing javadoc for readability is non functional, changing the meaning is functional. > > One of the points in #60 was to make user errors *visible*, rather than silently passing - that was one reason I pushed that PR hard. Now looking again into `OrderedContext` that's what this PR brings and I strongly disagree. There's two ways to cope with those errors in the order of registration I know of: either throw an error (and I don't mind if it's from the `Core.checkpointRestore` or straight from `register`) or block, and you've chosen the latter option, stating that the problem is clearly visible from the threaddump. It was burried along other things that PR was doing. After some thought [1] Resources are used for two different purposes: fix-up the state to ensure checkpoint won't fail later (close file descriptors, which otherwise will be detected later); and later we discovered some resources are ensuring properties of the image (e.g. cleaning secrets). The blocking behavior makes sense for the latter, and for real concurrent registration it uses the race to delay registration and thus the problem is resolved without user intervention. But it brings significant usability drawback with the deadlock with registration from the same thread when notification has been started. And this is IMO more frequent that a registration from another thread, which is waited for by the thread executing e.g. beforeCheckpoint. For the clean-up, blocking We should not impose such a big inconvenience for the more frequent use case, regardless how the error is detected or reported. Applications, in their turn, can implement a blocking context with necessary guarantees (or use some implementation, which we eventually expose, once we got this contextes shaked out). My take on blocking is preffered to an exception was in context of JDK implementation: we do not expose the problem via API, but auto-deadlocks is a bug in the JDK implementaiton. [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545853147 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1210208633 From akozlov at openjdk.org Tue May 30 12:46:22 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 30 May 2023 12:46:22 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: References: Message-ID: On Thu, 25 May 2023 07:25:17 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 151: >> >>> 149: while (blockForCheckpoint) { >>> 150: wait(); >>> 151: } >> >> Once we've got here, is it possible to ensure Cleaners has been called, and drop separate registration of Cleaners? >> >> A concurrent cleaner registration is not a problem, as that depends on GC which is not predictable. I.e. if that happens, a slight race may also cause the cleaner to run after restore. > > I don't follow. PhantomCleanableRefs are cleaned strictly after this point (these have later priority), other cleaners will be queued up by GC but not called until restore. I mean, once we synchronized cleaner thread from picking up PhantomCleanableRefs, there is apparently no reason to go to the next priority and discover cleanups through registrations in the CRaC -- instead, all cleanups are discoverable thorugh the list maintained by the cleaner. It seems we can go through the list right from this Resource and run cleanups. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1210217106 From akozlov at openjdk.org Tue May 30 14:08:28 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 30 May 2023 14:08:28 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v3] In-Reply-To: References: Message-ID: <5tWVRK3bl2hPFqiiSptqK8RuBcYM0qEIfq26noI_v9I=.4c0967dd-71f1-432f-9833-8980f803c62d@github.com> > A follow-up work for #60: > > * Each priority now has a dedicated context, so contextes may provide different policies. > * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. > * hierachy of the Context implementations are cleaned up a bit [2] > > The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. > > [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 > [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Actually remove CriticalUnorderedContext ------------- Changes: - all: https://git.openjdk.org/crac/pull/74/files - new: https://git.openjdk.org/crac/pull/74/files/5182ba92..0eac2591 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=74&range=02 - incr: https://webrevs.openjdk.org/?repo=crac&pr=74&range=01-02 Stats: 91 lines in 2 files changed: 0 ins; 91 del; 0 mod Patch: https://git.openjdk.org/crac/pull/74.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/74/head:pull/74 PR: https://git.openjdk.org/crac/pull/74 From duke at openjdk.org Tue May 30 14:32:23 2023 From: duke at openjdk.org (Radim Vansa) Date: Tue, 30 May 2023 14:32:23 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v3] In-Reply-To: References: Message-ID: On Tue, 30 May 2023 12:36:27 GMT, Anton Kozlov wrote: >> One of the points in #60 was to make user errors *visible*, rather than silently passing - that was one reason I pushed that PR hard. Now looking again into `OrderedContext` that's what this PR brings and I strongly disagree. There's two ways to cope with those errors in the order of registration I know of: either throw an error (and I don't mind if it's from the `Core.checkpointRestore` or straight from `register`) or block, and you've chosen the latter option, stating that the problem is clearly visible from the threaddump. > > It was burried along other things that PR was doing. After some thought [1] Resources are used for two different purposes: fix-up the state to ensure checkpoint won't fail later (close file descriptors, which otherwise will be detected later); and later we discovered some resources are ensuring properties of the image (e.g. cleaning secrets). The blocking behavior makes sense for the latter, and for real concurrent registration it uses the race to delay registration and thus the problem is resolved without user intervention. But it brings significant usability drawback with the deadlock with registration from the same thread when notification has been started. And this is IMO more frequent that a registration from another thread, which is waited for by the thread executing e.g. beforeCheckpoint. For the clean-up, blocking > > We should not impose such a big inconvenience for the more frequent use case, regardless how the error is detected or reported. Applications, in their turn, can implement a blocking context with necessary guarantees (or use some implementation, which we eventually expose, once we got this contextes shaked out). > > My take on blocking is preffered to an exception was in context of JDK implementation: we do not expose the problem via API, but auto-deadlocks is a bug in the JDK implementaiton. > > [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545853147 You won't know whether the resource registered in global context is to be invoked optionally or if it's mandatory; any checks like those for FDs is only a last line of defense and debugging help. Therefore not being executed before C/R is a problem the platform should detect, preventing C/R from happening, to be on the safe side. We should not choose the unsafe for default, because users likely won't find out that something fishy is happening without any 'inconvenience'. As you say, users can change the semantics in their own context once they know. I agree that deadlock is not the best way to report this, if you want to choice different means for the global context I am okay with that. We've talked about `register` throwing might not be the best option as the exception could be swallowed, exception from `Core.checkpointRestore` along with stack trace of the registration would be preferable. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1210371421 From akozlov at openjdk.org Tue May 30 16:03:32 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 30 May 2023 16:03:32 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v3] In-Reply-To: References: Message-ID: On Tue, 30 May 2023 14:29:41 GMT, Radim Vansa wrote: >> It was burried along other things that PR was doing. After some thought [1] Resources are used for two different purposes: fix-up the state to ensure checkpoint won't fail later (close file descriptors, which otherwise will be detected later); and later we discovered some resources are ensuring properties of the image (e.g. cleaning secrets). The blocking behavior makes sense for the latter, and for real concurrent registration it uses the race to delay registration and thus the problem is resolved without user intervention. But it brings significant usability drawback with the deadlock with registration from the same thread when notification has been started. And this is IMO more frequent that a registration from another thread, which is waited for by the thread executing e.g. beforeCheckpoint. For the clean-up, blocking >> >> We should not impose such a big inconvenience for the more frequent use case, regardless how the error is detected or reported. Applications, in their turn, can implement a blocking context with necessary guarantees (or use some implementation, which we eventually expose, once we got this contextes shaked out). >> >> My take on blocking is preffered to an exception was in context of JDK implementation: we do not expose the problem via API, but auto-deadlocks is a bug in the JDK implementaiton. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545853147 > > You won't know whether the resource registered in global context is to be invoked optionally or if it's mandatory; any checks like those for FDs is only a last line of defense and debugging help. Therefore not being executed before C/R is a problem the platform should detect, preventing C/R from happening, to be on the safe side. We should not choose the unsafe for default, because users likely won't find out that something fishy is happening without any 'inconvenience'. As you say, users can change the semantics in their own context once they know. > > I agree that deadlock is not the best way to report this, if you want to choice different means for the global context I am okay with that. We've talked about `register` throwing might not be the best option as the exception could be swallowed, exception from `Core.checkpointRestore` along with stack trace of the registration would be preferable. Exactly, that's why existing global context cannot satisfy all requirements. I'm fine to continue this discussion, but as a part of separate discussion about the global context behavior, separated from the refactoring. OK, to make it more comfortable, let's use BlockingContext, once I allowed that to sleep in. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1210499871 From akozlov at openjdk.org Tue May 30 16:14:10 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 30 May 2023 16:14:10 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v4] In-Reply-To: References: Message-ID: > A follow-up work for #60: > > * Each priority now has a dedicated context, so contextes may provide different policies. > * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. > * hierachy of the Context implementations are cleaned up a bit [2] > > The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. > > [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 > [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Update ------------- Changes: - all: https://git.openjdk.org/crac/pull/74/files - new: https://git.openjdk.org/crac/pull/74/files/0eac2591..9d857c37 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=74&range=03 - incr: https://webrevs.openjdk.org/?repo=crac&pr=74&range=02-03 Stats: 3 lines in 3 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/crac/pull/74.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/74/head:pull/74 PR: https://git.openjdk.org/crac/pull/74 From akozlov at openjdk.org Tue May 30 16:24:24 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Tue, 30 May 2023 16:24:24 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: <82OoQQrh1BZ_epMMmT9P-qabiJtcHnyoBeYD8sEAiDo=.84f640c4-8eb7-46b5-ad86-0c6d3b8ddd38@github.com> References: <82OoQQrh1BZ_epMMmT9P-qabiJtcHnyoBeYD8sEAiDo=.84f640c4-8eb7-46b5-ad86-0c6d3b8ddd38@github.com> Message-ID: On Thu, 25 May 2023 06:56:51 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/internal/ref/CleanerImpl.java line 182: >> >>> 180: // completed by the cleaner. >>> 181: blockForCheckpoint = true; >>> 182: thread.interrupt(); >> >> Why it has to be interrupt and not notify(), for example? > > The interrupt wakes up cleaner in `queue.remove()` (line 161) in case it's blocking for next task. OK, thanks. But what if the interrupt arrives to the thread during a cleaning while that runs, it may interrupt the cleaning, right? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1210523210 From jkratochvil at openjdk.org Tue May 30 16:42:06 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Tue, 30 May 2023 16:42:06 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v23] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with three additional commits since the last revision: - +"CPU features being used are" - Fix glibc setting of -XX:CPUFeatures=native. - [!INCLUDE_CPU_FEATURE_ACTIVE && !INCLUDE_LD_SO_LIST_DIAGNOSTICS] Fix useless re-exec. Add -XX:CPUFeatures=verify. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/96174b55..18e93d56 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=22 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=21-22 Stats: 33 lines in 2 files changed: 29 ins; 0 del; 4 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Tue May 30 17:02:57 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Tue, 30 May 2023 17:02:57 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: References: Message-ID: > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: +-XX:CPUFeatures=ignore ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/18e93d56..aabbcd79 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=23 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=22-23 Stats: 25 lines in 2 files changed: 15 ins; 3 del; 7 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Tue May 30 17:03:51 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Tue, 30 May 2023 17:03:51 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v22] In-Reply-To: References: Message-ID: On Fri, 26 May 2023 15:17:25 GMT, Anton Kozlov wrote: >> Jan Kratochvil has updated the pull request incrementally with four additional commits since the last revision: >> >> - refactor >> - refactor >> - refactor >> - Fix el7 compatibility. > > src/hotspot/cpu/x86/vm_version_x86.cpp line 1297: > >> 1295: } >> 1296: >> 1297: glibc_not_using((CPU_MAX - 1) & ~_features, (GLIBC_MAX - 1) & ~_glibc_features); > > Apparently this triggers re-exec even with default command line parameters, which does not look expected, see below. > > Also, we need a way to disable CPU feature reduction as a workaround to possible bugs in the implementation. > > > $ ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:+ShowCPUFeatures -version > This machine's CPU features are: -XX:CPUFeatures=0x61805fdfbff,0x1e6 > Re-exec of java with new environment variable: GLIBC_TUNABLES=:glibc.cpu.hwcaps=,-RTM,-AVX512F,-AVX512CD,-AVX512BW,-AVX512DQ,-AVX512ER,-AVX512PF,-AVX512VL,-IBT,-FMA4,-SHSTK > This machine's CPU features are: -XX:CPUFeatures=0x61805fdfbff,0x1e6 > Environment variable already set, both glibc CPU_FEATURE_ACTIVE and ld.so --list-diagnostics are unavailable - re-exec suppressed: GLIBC_TUNABLES=:glibc.cpu.hwcaps=,-RTM,-AVX512F,-AVX512CD,-AVX512BW,-AVX512DQ,-AVX512ER,-AVX512PF,-AVX512VL,-IBT,-FMA4,-SHSTK > openjdk version "17-internal" 2021-09-14 > OpenJDK Runtime Environment (build 17-internal+0-adhoc..crac) > OpenJDK 64-Bit Server VM (build 17-internal+0-adhoc..crac, mixed mode, sharing) I have remembered that was intentional for more safety of what jdk vs. glibc detect from CPU. I agree it is too expensive to do a jdk re-exec each time. The behavior you have seen can be now achieved by `-XX:CPUFeatures=verify`. Usefulness of this new option may not be too big. To disable everything around the CPU features is now possible by `-XX:CPUFeatures=ignore`. I agree it may be useful if the glibc interface for these CPU features gets somehow incompatible. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1210569697 From duke at openjdk.org Wed May 31 06:27:27 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 31 May 2023 06:27:27 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v4] In-Reply-To: References: Message-ID: <4yw-YAG8uKkOQIZssvfiWhsgrBFrw7Z_STdvrxz2Ous=.3f84e9f7-6e35-4add-b5d0-6159798c571a@github.com> On Tue, 30 May 2023 16:14:10 GMT, Anton Kozlov wrote: >> A follow-up work for #60: >> >> * Each priority now has a dedicated context, so contextes may provide different policies. >> * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. >> * hierachy of the Context implementations are cleaned up a bit [2] >> >> The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. >> >> [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 >> [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 > > Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Update src/java.base/share/classes/java/lang/ref/Cleaner.java line 225: > 223: > 224: /** > 225: * Register an object and object and also register the underlying Reference with a CRaC priority. Typo src/java.base/share/classes/javax/crac/Core.java line 40: > 38: } > 39: > 40: private static final Context globalContext = new ContextWrapper(new OrderedContext<>()); Please use BlockingOrderedContext here as well, for symmetry. src/java.base/share/classes/jdk/crac/Core.java line 124: > 122: > 123: try { > 124: jdk.internal.crac.Core.getJDKContext().beforeCheckpoint(null); Why don't you register this as a resource? Besides, the way you did it does not work correctly if exceptions are thrown. src/java.base/share/classes/jdk/crac/impl/OrderedContext.java line 38: > 36: * @param > 37: */ > 38: public class OrderedContext extends AbstractContext { This class should not be used anywhere directly, and is not exposed to users => should be abstract. src/java.base/share/classes/jdk/internal/crac/Core.java line 35: > 33: private static JDKContext jdkContext = new JDKContext(); > 34: > 35: public static JDKContext getJDKContext() { Please rename the method along with `JDKContext` rename. src/java.base/share/classes/jdk/internal/crac/JDKContext.java line 46: > 44: import java.util.WeakHashMap; > 45: > 46: public class JDKContext implements JDKResource { Since this is not a context anymore I believe that in the current form the class would deserve a rename (`ClaimedFileDescriptors`?) test/jdk/jdk/crac/ContextOrderTest.java line 102: > 100: private static void testRegisterBlocks() throws Exception { > 101: var recorder = new LinkedList(); > 102: BlockingOrderedContext blockingCtx = new BlockingOrderedContext(); We don't need separate contexts now that global one is blocking. test/jdk/jdk/crac/ContextOrderTest.java line 147: > 145: thread.interrupt(); > 146: thread.join(TimeUnit.NANOSECONDS.toMillis(deadline - System.nanoTime())); > 147: System.out.println(thread.getState() + " " + thread.isAlive()); Looks like forgotten debug logging ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211119470 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211120888 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211123029 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211127722 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211135018 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211131166 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211138301 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211139415 From duke at openjdk.org Wed May 31 06:27:27 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 31 May 2023 06:27:27 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v4] In-Reply-To: References: Message-ID: On Thu, 18 May 2023 12:31:53 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Update > > src/java.base/share/classes/jdk/crac/impl/BlockingOrderedContext.java line 20: > >> 18: // We won't cause IllegalStateException because this is not an unexpected state >> 19: // from the point of CRaC - it probably tried to register some code before. >> 20: throw new RuntimeException("Interrupted thread tried to block in registration of " + resource + " in " + this); > > The use of `this` in the exception relies on naming the context and the `toString()` method for easy identification. Since you've removed these it will show only class type rather than Global Context/custom name/JDK resource priority. I am not sure if this was missed - you've created named context in the tests (so the message in there looks right) but this still does not work well for any contexts regularly used (global/core priority contexts). ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211123987 From duke at openjdk.org Wed May 31 07:57:19 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 31 May 2023 07:57:19 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: References: <82OoQQrh1BZ_epMMmT9P-qabiJtcHnyoBeYD8sEAiDo=.84f640c4-8eb7-46b5-ad86-0c6d3b8ddd38@github.com> Message-ID: <-RFlB7ltnNOwaEW8_tYPvxmM5zEfiKwE_4kR5nXN_S4=.9fb8e3df-b478-4446-a2b8-ccc59dbb27cb@github.com> On Tue, 30 May 2023 16:19:24 GMT, Anton Kozlov wrote: >> The interrupt wakes up cleaner in `queue.remove()` (line 161) in case it's blocking for next task. > > OK, thanks. But what if the interrupt arrives to the thread during a cleaning while that runs, it may interrupt the cleaning, right? That's correct; I don't see any way to mitigate that, though, besides changing the queue implementation to allow more targeted wakeup. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1211242978 From duke at openjdk.org Wed May 31 08:11:24 2023 From: duke at openjdk.org (Radim Vansa) Date: Wed, 31 May 2023 08:11:24 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: References: Message-ID: On Tue, 30 May 2023 12:43:44 GMT, Anton Kozlov wrote: >> I don't follow. PhantomCleanableRefs are cleaned strictly after this point (these have later priority), other cleaners will be queued up by GC but not called until restore. > > I mean, once we synchronized cleaner thread from picking up PhantomCleanableRefs, there is apparently no reason to go to the next priority and discover cleanups through registrations in the CRaC -- instead, all cleanups are discoverable thorugh the list maintained by the cleaner. It seems we can go through the list right from this Resource and run cleanups. Alright, you mean that PCRs won't be Resources anymore and we'll clean them up whether are these already in the `queue` or not, without waiting for the GC to move them into that. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1211258883 From akozlov at openjdk.org Wed May 31 08:51:26 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 08:51:26 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v4] In-Reply-To: <4yw-YAG8uKkOQIZssvfiWhsgrBFrw7Z_STdvrxz2Ous=.3f84e9f7-6e35-4add-b5d0-6159798c571a@github.com> References: <4yw-YAG8uKkOQIZssvfiWhsgrBFrw7Z_STdvrxz2Ous=.3f84e9f7-6e35-4add-b5d0-6159798c571a@github.com> Message-ID: On Wed, 31 May 2023 05:58:00 GMT, Radim Vansa wrote: >> Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Update > > src/java.base/share/classes/jdk/crac/Core.java line 124: > >> 122: >> 123: try { >> 124: jdk.internal.crac.Core.getJDKContext().beforeCheckpoint(null); > > Why don't you register this as a resource? Besides, the way you did it does not work correctly if exceptions are thrown. This part was intentionally omitted. JDKContext does not throw. > The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. > src/java.base/share/classes/jdk/crac/impl/OrderedContext.java line 38: > >> 36: * @param >> 37: */ >> 38: public class OrderedContext extends AbstractContext { > > This class should not be used anywhere directly, and is not exposed to users => should be abstract. By it's own it's a valid Contex implementation. > src/java.base/share/classes/jdk/internal/crac/JDKContext.java line 46: > >> 44: import java.util.WeakHashMap; >> 45: >> 46: public class JDKContext implements JDKResource { > > Since this is not a context anymore I believe that in the current form the class would deserve a rename (`ClaimedFileDescriptors`?) See other comment https://github.com/openjdk/crac/pull/74#discussion_r1211309286 > test/jdk/jdk/crac/ContextOrderTest.java line 102: > >> 100: private static void testRegisterBlocks() throws Exception { >> 101: var recorder = new LinkedList(); >> 102: BlockingOrderedContext blockingCtx = new BlockingOrderedContext(); > > We don't need separate contexts now that global one is blocking. The test checks blocking properties, while blocking behavior of the global context is not specified. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211309286 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211310150 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211310555 PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211311935 From akozlov at openjdk.org Wed May 31 08:57:08 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 08:57:08 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v5] In-Reply-To: References: Message-ID: > A follow-up work for #60: > > * Each priority now has a dedicated context, so contextes may provide different policies. > * the Global Context reverted from BlockingOrderedContext to OrderedContext, as that may have a huge impact on users. Probably we'll want to expose blocking/criticalUnorderd context along the global one, or at some point expose an implementation. But this is also out of scope of the PR. > * hierachy of the Context implementations are cleaned up a bit [2] > > The JDKContext is now just a holder of ClaimedFDs, I'll address this in a follow-up that depends on this Context follow-up. > > [1] https://github.com/openjdk/crac/pull/60#issuecomment-1545588281 > [2] https://github.com/openjdk/crac/pull/60#discussion_r1185510445 Anton Kozlov has updated the pull request incrementally with one additional commit since the last revision: Update ------------- Changes: - all: https://git.openjdk.org/crac/pull/74/files - new: https://git.openjdk.org/crac/pull/74/files/9d857c37..ff3ef243 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=74&range=04 - incr: https://webrevs.openjdk.org/?repo=crac&pr=74&range=03-04 Stats: 4 lines in 3 files changed: 1 ins; 1 del; 2 mod Patch: https://git.openjdk.org/crac/pull/74.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/74/head:pull/74 PR: https://git.openjdk.org/crac/pull/74 From akozlov at openjdk.org Wed May 31 08:57:30 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 08:57:30 GMT Subject: [crac] RFR: Introduce per-Priority Context with different policies [v5] In-Reply-To: References: Message-ID: On Wed, 31 May 2023 05:59:31 GMT, Radim Vansa wrote: >> src/java.base/share/classes/jdk/crac/impl/BlockingOrderedContext.java line 20: >> >>> 18: // We won't cause IllegalStateException because this is not an unexpected state >>> 19: // from the point of CRaC - it probably tried to register some code before. >>> 20: throw new RuntimeException("Interrupted thread tried to block in registration of " + resource + " in " + this); >> >> The use of `this` in the exception relies on naming the context and the `toString()` method for easy identification. Since you've removed these it will show only class type rather than Global Context/custom name/JDK resource priority. > > I am not sure if this was missed - you've created named context in the tests (so the message in there looks right) but this still does not work well for any contexts regularly used (global/core priority contexts). The name complicates the interface and from the stack trace the context will be completely evident. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/74#discussion_r1211321284 From akozlov at openjdk.org Wed May 31 09:24:31 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 09:24:31 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: References: Message-ID: On Tue, 30 May 2023 17:02:57 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > +-XX:CPUFeatures=ignore src/hotspot/cpu/x86/vm_version_x86.cpp line 648: > 646: if (strcmp(ccstr, "ignore") == 0) { > 647: return _features; > 648: } Should not be here `ignore_glibc_not_using = true` ? I cannot see a place where ignore_glibc_not_using is set to true, could you point? src/hotspot/cpu/x86/vm_version_x86.cpp line 1319: > 1317: if (ShowCPUFeatures) { > 1318: if (ignore_glibc_not_using) { > 1319: tty->print_cr("CPU features are being kept intact as requested by -XX:CPUFeatures=ignore"); Could you check whitespace error https://github.com/openjdk/crac/pull/41/checks?check_run_id=13867663789 ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211337287 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211335374 From akozlov at openjdk.org Wed May 31 09:45:24 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 09:45:24 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: References: Message-ID: On Wed, 31 May 2023 08:08:13 GMT, Radim Vansa wrote: >> I mean, once we synchronized cleaner thread from picking up PhantomCleanableRefs, there is apparently no reason to go to the next priority and discover cleanups through registrations in the CRaC -- instead, all cleanups are discoverable thorugh the list maintained by the cleaner. It seems we can go through the list right from this Resource and run cleanups. > > Alright, you mean that PCRs won't be Resources anymore and we'll clean them up whether are these already in the `queue` or not, without waiting for the GC to move them into that. Something like that, but indeed the interaction between RefQueue and the list maintained by the cleaner should reviewed in the context is it safe and correct to do that. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1211383901 From akozlov at openjdk.org Wed May 31 09:50:31 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 09:50:31 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: References: Message-ID: <8xx4BgFoTXFFUZJvB4Rgg6UrF9hJDt-sCUrE_Nz8cRc=.895586ba-1428-4672-9dd7-ebaffd8b6146@github.com> On Tue, 30 May 2023 17:02:57 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > +-XX:CPUFeatures=ignore src/hotspot/os/linux/os_linux.cpp line 5962: > 5960: initialize_processor_count(); > 5961: if (_cpu_to_node != NULL) > 5962: rebuild_cpu_to_node_map(); It seems the only place where number of processors is updated, so it's not clear how safe the operation. Please add a debug option for processor count update, I propose that to be disabled by default. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211390026 From akozlov at openjdk.org Wed May 31 11:06:28 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 11:06:28 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v22] In-Reply-To: References: Message-ID: On Tue, 30 May 2023 17:01:43 GMT, Jan Kratochvil wrote: > I have remembered that was intentional for more safety of what jdk vs. glibc detect from CPU. Could you elaborate a bit about the safety? In JDK we should have the same algorithm for determing glibc features. Do you mean the situation when JDK and GLIBC feature detection have diverged, e.g. due to update in the glibc? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211531331 From akozlov at openjdk.org Wed May 31 11:11:30 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 11:11:30 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: References: Message-ID: On Tue, 30 May 2023 17:02:57 GMT, Jan Kratochvil wrote: >> Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. >> >> 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. >> 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC >> >> (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: >> >>> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine >> >> It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: >> >>> Error occurred during initialization of VM >>> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi >> >> (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. >> >> If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfo... > > Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: > > +-XX:CPUFeatures=ignore src/hotspot/cpu/x86/vm_version_x86.cpp line 1335: > 1333: glibc_not_using( features_expected & ~ _features, > 1334: glibc_features_expected & ~_glibc_features); > 1335: } I'm trying the code, and getting anton at mercury:~/proj/crac$ ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:+ShowCPUFeatures -XX:CPUFeatures=generic -version This machine's CPU features are: -XX:CPUFeatures=0x61805fdfbff,0x1e6 CPU features being used are: -XX:CPUFeatures=0x200000080d7,0x0 openjdk version "17-internal" 2021-09-14 OpenJDK Runtime Environment (build 17-internal+0-adhoc..crac) OpenJDK 64-Bit Server VM (build 17-internal+0-adhoc..crac, mixed mode, sharing) I would expect re-exec, but for some reason it was not perfromed? src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp line 224: > 222: stub = VM_Version::cpuinfo_cont_addr(); > 223: } > 224: } else Why do we need this block? Is not this duplicating logic from line 251? ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211538964 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211536242 From akozlov at openjdk.org Wed May 31 11:55:24 2023 From: akozlov at openjdk.org (Anton Kozlov) Date: Wed, 31 May 2023 11:55:24 GMT Subject: [crac] RFR: Prevent concurrent cleanup by cleaner thread and checkpoint notifications In-Reply-To: <-RFlB7ltnNOwaEW8_tYPvxmM5zEfiKwE_4kR5nXN_S4=.9fb8e3df-b478-4446-a2b8-ccc59dbb27cb@github.com> References: <82OoQQrh1BZ_epMMmT9P-qabiJtcHnyoBeYD8sEAiDo=.84f640c4-8eb7-46b5-ad86-0c6d3b8ddd38@github.com> <-RFlB7ltnNOwaEW8_tYPvxmM5zEfiKwE_4kR5nXN_S4=.9fb8e3df-b478-4446-a2b8-ccc59dbb27cb@github.com> Message-ID: On Wed, 31 May 2023 07:54:59 GMT, Radim Vansa wrote: >> OK, thanks. But what if the interrupt arrives to the thread during a cleaning while that runs, it may interrupt the cleaning, right? > > That's correct; I don't see any way to mitigate that, though, besides changing the queue implementation to allow more targeted wakeup. Hmm, this is a problem. ReferenceQueue may have a JDK-internal interface via SharedSecretes to interrupt the wait. Or the interface to remove entries without any wait. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/73#discussion_r1211586523 From jkratochvil at openjdk.org Wed May 31 13:21:33 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 31 May 2023 13:21:33 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v22] In-Reply-To: References: Message-ID: <7TlK6MqAjjFLvF0KZ1Vvg5ptoVLRieT-wIir34gIs1I=.3c6f5640-38d6-4525-965a-a9df377eb4ea@github.com> On Wed, 31 May 2023 11:03:44 GMT, Anton Kozlov wrote: >> I have remembered that was intentional for more safety of what jdk vs. glibc detect from CPU. I agree it is too expensive to do a jdk re-exec each time. The behavior you have seen can be now achieved by `-XX:CPUFeatures=verify`. Usefulness of this new option may not be too big. >> >> To disable everything around the CPU features is now possible by `-XX:CPUFeatures=ignore`. I agree it may be useful if the glibc interface for these CPU features gets somehow incompatible. > >> I have remembered that was intentional for more safety of what jdk vs. glibc detect from CPU. > > Could you elaborate a bit about the safety? In JDK we should have the same algorithm for determing glibc features. Do you mean the situation when JDK and GLIBC feature detection have diverged, e.g. due to update in the glibc? I have removed the option `-XX:CPUFeatures=verify`. It had some effect only on old glibcs and even there it did only check jdk against jdk's own logic. Not against glibc. Diversion of JDK and GLIBC feature detection is tested by default but only if either `CPU_FEATURE_ACTIVE` or `ld.so --list-diagnostics` are available ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211706577 From jkratochvil at openjdk.org Wed May 31 13:40:29 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 31 May 2023 13:40:29 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v25] In-Reply-To: References: Message-ID: <6zX1-VRXs-7o6ap7LxpyFKuSeNMAqOKSrhUnyTZJj88=.30f86973-b567-4f2b-af13-f87f455bad69@github.com> > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with two additional commits since the last revision: - Fix missing "ignore_glibc_not_using = true". - bugreported by Anton Kozlov - Remove -XX:CPUFeatures=verify as not useful. Fix !INCLUDE_CPU_FEATURE_ACTIVE && !INCLUDE_LD_SO_LIST_DIAGNOSTICS. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/aabbcd79..6a8a2b8e Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=24 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=23-24 Stats: 32 lines in 2 files changed: 6 ins; 22 del; 4 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Wed May 31 13:40:30 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 31 May 2023 13:40:30 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: References: Message-ID: <0qBsXwpqjQplYjGq7gq6KE9vPW7KYy_lHJWeITPH4oY=.f1dc4f29-a1fb-43d2-9bd1-949f206501ad@github.com> On Wed, 31 May 2023 09:07:48 GMT, Anton Kozlov wrote: >> Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: >> >> +-XX:CPUFeatures=ignore > > src/hotspot/cpu/x86/vm_version_x86.cpp line 648: > >> 646: if (strcmp(ccstr, "ignore") == 0) { >> 647: return _features; >> 648: } > > Should not be here `ignore_glibc_not_using = true` ? I cannot see a place where ignore_glibc_not_using is set to true, could you point? I agree, fixed. > src/hotspot/cpu/x86/vm_version_x86.cpp line 1319: > >> 1317: if (ShowCPUFeatures) { >> 1318: if (ignore_glibc_not_using) { >> 1319: tty->print_cr("CPU features are being kept intact as requested by -XX:CPUFeatures=ignore"); > > Could you check whitespace error https://github.com/openjdk/crac/pull/41/checks?check_run_id=13867663789 Sorry, I have provided some local reconfigurations so that it hopefully should not happen again. > src/hotspot/cpu/x86/vm_version_x86.cpp line 1335: > >> 1333: glibc_not_using( features_expected & ~ _features, >> 1334: glibc_features_expected & ~_glibc_features); >> 1335: } > > I'm trying the code, and getting > > > anton at mercury:~/proj/crac$ ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:+ShowCPUFeatures -XX:CPUFeatures=generic -version > This machine's CPU features are: -XX:CPUFeatures=0x61805fdfbff,0x1e6 > CPU features being used are: -XX:CPUFeatures=0x200000080d7,0x0 > openjdk version "17-internal" 2021-09-14 > OpenJDK Runtime Environment (build 17-internal+0-adhoc..crac) > OpenJDK 64-Bit Server VM (build 17-internal+0-adhoc..crac, mixed mode, sharing) > > > I would expect re-exec, but for some reason it was not perfromed? That was a regression/bug, thanks, fixed. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211725314 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211722499 PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211726817 From jkratochvil at openjdk.org Wed May 31 13:59:42 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 31 May 2023 13:59:42 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v26] In-Reply-To: References: Message-ID: <-6-8R_QvguI5VIHHN-qivBe54RmpObCNru0-teJTBA0=.ce5d8fec-764d-4c11-bfde-ea4ca9f0a79a@github.com> > Currently if you `-XX:CRaCCheckpointTo` on a better CPU and `-XX:CRaCRestoreFrom` on a worse CPU the restored OpenJDK will crash. > > 1. An obvious reason is that JIT-compiled code is using CPU features not implemented on the CPU where the image is restored. > 2. A second reason is that glibc has a similar problem, its PLT entries point to CPU optimized functions also crashing on the worse CPU. https://sourceware.org/glibc/wiki/GNU_IFUNC > > (1) could be solved somehow automatically by deoptimizing and re-JITing all the JIT code. But that would defeat the performance goal of restoring a ready image in the first place. Therefore there had to be implemented a new OpenJDK option: > >> use -XX:CPUFeatures=0xnumber with -XX:CRaCCheckpointTo when you get an error during -XX:CRaCRestoreFrom on a different machine > > It is intended to specify the lowest common denominator of all CPUs in a farm. Instead of a possible crash of OpenJDK it will now refuse to run: > >> Error occurred during initialization of VM >> You have to specify -XX:CPUFeatures=0x421801fcfbd7 during -XX:CRaCCheckpointTo making of the checkpoint; specified -XX:CRaCRestoreFrom file contains CPU features 0x7fff9dfcfbf7; this machine's CPU features are 0x421801fcfbd7; missing features of this CPU are 0x3de79c000020 = 3dnowpref, adx, avx512f, avx512dq, avx512cd, avx512bw, avx512vl, sha, avx512_vpopcntdq, avx512_vpclmulqdq, avx512_vaes, avx512_vnni, clflushopt, clwb, avx512_vbmi2, avx512_vbmi > > (2) has been implemented according to Anton Kozlov's idea that glibc can just reset its IFUNC PLT entries any time later (after restore), not just during the first initialization of glibc. That has currently a problem that it has turned out to be very invasive into private glibc structures. It could work somehow with glibc debuginfo (*-debuginfo.rpm or *-dbg.deb) installed but that has been considered as unacceptable requirement just to run CRaC. Therefore I have provided this proof of concept while I will propose such feature for glibc upstream where it is sure easily implementable. > > If upstream glibc maintainers do not like the IFUNC reset idea then I do not think this hacky IFUNC reset patching many glibc internal data structures is a good way forward for a 3rd party implementation like CRaC/OpenJDK. In such case I believe one should switch to using GLIBC_TUNABLES environment variable, re-execing OpenJDK after converting the `-XX:CPUFeatures` OpenJDK format into glibc GLIBC_TUNABLES format. Unfortunately there... Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: Remove leftover code. ------------- Changes: - all: https://git.openjdk.org/crac/pull/41/files - new: https://git.openjdk.org/crac/pull/41/files/6a8a2b8e..28a7b591 Webrevs: - full: https://webrevs.openjdk.org/?repo=crac&pr=41&range=25 - incr: https://webrevs.openjdk.org/?repo=crac&pr=41&range=24-25 Stats: 7 lines in 1 file changed: 0 ins; 7 del; 0 mod Patch: https://git.openjdk.org/crac/pull/41.diff Fetch: git fetch https://git.openjdk.org/crac.git pull/41/head:pull/41 PR: https://git.openjdk.org/crac/pull/41 From jkratochvil at openjdk.org Wed May 31 13:59:46 2023 From: jkratochvil at openjdk.org (Jan Kratochvil) Date: Wed, 31 May 2023 13:59:46 GMT Subject: [crac] RFR: RFC: -XX:CPUFeatures=0xnumber for CPU migration [v24] In-Reply-To: References: Message-ID: On Wed, 31 May 2023 11:06:25 GMT, Anton Kozlov wrote: >> Jan Kratochvil has updated the pull request incrementally with one additional commit since the last revision: >> >> +-XX:CPUFeatures=ignore > > src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp line 224: > >> 222: stub = VM_Version::cpuinfo_cont_addr(); >> 223: } >> 224: } else > > Why do we need this block? Is not this duplicating logic from line 251? It was needed for some previous variant of the patch and it is a leftover. Removed. Sorry for not self-reviewing this version of my patch. ------------- PR Review Comment: https://git.openjdk.org/crac/pull/41#discussion_r1211761221