<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Apr 5, 2023 at 12:01 PM Radim Vansa <<a href="mailto:rvansa@azul.com">rvansa@azul.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Lot of interesting ideas, comments inline....<br>

<br>

On 05. 04. 23 16:28, Dan Heidinga wrote:<br>

> Hi Radim,<br>

><br>

> Thanks for the write up of the various options in this space.<br>

><br>

> On Tue, Apr 4, 2023 at 2:49 AM Radim Vansa <<a href="mailto:rvansa@azul.com" target="_blank">rvansa@azul.com</a><mailto:<a href="mailto:rvansa@azul.com" target="_blank">rvansa@azul.com</a>>> wrote:<br>

> Hi Christian,<br>

><br>

> I believe this is a common problem when porting existing architecture<br>

> under CRaC; the obvious solution is to guard access to the resource<br>

> (ProcessorContext in this case) with a RW lock that'd be read-acquired<br>

> by 'regular' access and acquired for write in beforeCheckpoint/released<br>

> in afterRestore. However this introduces extra synchronization (at least<br>

> in form of volatile writes) even in case that C/R is not used at all,<br>

> especially if the support is added into libraries.<br>

><br>

> I've seen variations of this approach go by in code reviews but have we written up a good example of how to do this well?  Having a canonical pattern would help to highlight the best way to do it today and make the tradeoffs explicit.<br>

<br>

TBH I am rather shy to demo a solution that's quite imperfect unless we <br>

find that there's no way around that.<br></blockquote><div><br></div><div>That's fair.  Sometimes it's easier for people to start from something and modify it to be a better pattern than to have to generate it from scratch. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

<br>

><br>

><br>

> Anton Kozlov proposed techniques like RCU [1] but at this point there's<br>

> no support for this in Java. Even the Linux implementation might require<br>

> some additional properties from the code in critical (read) section like<br>

> not calling any blocking code; this might be too limiting.<br>

><br>

> The situation is simpler if the application uses a single threaded<br>

> event-loop; beforeCheckpoint can enqueue a task that would, upon its<br>

> execution, block on a primitive and notify the C/R notification thread<br>

> that it may now deinit the resource; in afterRestore the resource is<br>

> initialized and the eventloop is unblocked. This way we don't impose any<br>

> extra overhead when C/R is happening.<br>

><br>

> That's a nice idea!<br>

><br>

><br>

> To avoid extra synchronization it could be technically possible to<br>

> modify CRaC implementation to keep all other threads frozen during<br>

> restore. There's a risk of some form of deadlock if the thread<br>

> performing C/R would require other threads to progress, though, so any<br>

> such solution would require extra thoughts. Besides, this does not<br>

> guarantee exclusivity so the afterRestore would need to restore the<br>

> resource to the *exactly* same state (as some of its before-checkpoint<br>

> state might have leaked to the thread in Processor). In my opinion this<br>

> is not the best way.<br>

><br>

> This is the approach that OpenJ9 took to solve the consistency problems introduced by updating resources before / after checkpoints.  OpenJ9 enters "single threaded mode" when creating the checkpoint and executing the before checkkpoint fixups.  On restore, it continues in single-threaded mode while executing the after checkpoint fixups.  This makes it easier to avoid additional runtime costs related to per-resource locking for checkpoints, but complicates locking and wait/notify in general.<br>

><br>

> This means a checkpoint hook operation can't wait on another thread (would block indefinitely as other threads are paused), can't wait on a lock being held by another thread (again, would deadlock), and sending notify may result in inconsistent behaviour (wrong number of notifies received by other threads).  See "The checkpointJVM() API" section of their blog post on CRIU for more details [0].<br>

<br>

Great post, I should probably go through the whole blog. I think that <br>

the single-threaded mode is conceptually simple to think about and with <br>

the @NotCheckpointSafe annotation deals well with the issue. Have you <br>

run into any edge cases where this doesn't work well? For example I've <br>

seen a deadlock because in beforeCheckpoint something was supposed to <br>

run in the reference cleaner thread.<br></blockquote><div><br></div><div>I think we're still playing whack-a-mole with the places that may need @NotCheckpointSafe.  There's no good static analysis that will indicate where the annotation is needed so we're finding places to add it as we go.  Examples include: MethodType interning - if a halted thread was interning a MethodType, and the checkpoint thread needs to resolve a MethodType, we deadlock.  Also, ClassValue::initializeMap(), ClassSpecializer::findSpecies, ConcurrenthashMap::computeIfAbsent.</div><div><br></div><div>The more places we add the annotation though, the harder it will be to find a "safe" point to take the checkpoint without a livelock.  It's a mostly working bandaid rather than a general solution but it still might be good enough to make most applications work.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Also, did you run any tests to see the performance impact of your <br>

changes to wait/notify? Do you also have to tweak higher-level <br>

synchronization such as java.util.concurrent.*?<br></blockquote><div><br></div><div><br></div><div>The OpenJ9 locking code is quite different so results may vary.  I don't recall any measurable impact due to the wait/notify changes as they already tend to be slow paths.</div><div><br></div><div>I'm not aware of any changes to the j.u.c synchronization classes.  As I said, whack-a-mole and I don't think this "mole" has poked its head up yet.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

<br>

><br>

><br>

> The problem with RCU is tracking which threads are in the critical<br>

> section. I've found RCU-like implementations for Java that avoid<br>

> excessive overhead using a spread out array - each thread marks<br>

> entering/leaving the critical section by writes to its own counter,<br>

> preventing cache ping-pong (assuming no false sharing). Synchronizer<br>

> thread uses another flag to request synchronization; reading this by<br>

> each thread is not totally without cost but reasonably cheap, and in<br>

> that case worker threads can enter a blocking slow path. The simple<br>

> implementation assumes a fixed number of threads; if the list of threads<br>

> is dynamic the solution would be probably more complicated. It might<br>

> also make sense to implement this in native code with a per-CPU<br>

> counters, rather than per-thread. A downside, besides some overhead in<br>

> terms of both cycles and memory usage, is that we'd need to modify the<br>

> code and explicitly mark the critical sections.<br>

><br>

> Another solution could try to leverage existing JVM mechanics for code<br>

> deoptimization, replacing the critical sections with a slower, blocking<br>

> stub, and reverting back after restore. Or even independently requesting<br>

> a safe-point and inspecting stack of threads until the synchronization<br>

> is possible.<br>

><br>

> This will have a high risk of livelock.  The OpenJ9 experience implementing single-threaded mode for CRIU indicates there are a lot of strange locking patterns in the world.<br>

<br>

There are weird patterns but a more fine-grained exclusivity should be <br>

more permissive than single-threaded mode which works as an implicit Big <br>

Fat Lock. Yes, any problems are probably much better reproducible with <br>

that one.<br>

<br>

><br>

><br>

> So I probably can't offer a ready-to-use performant solution; pick your<br>

> poison. The future, though, offers a few possibilities and I'd love to<br>

> hear others' opinions about which one would look the most feasible.<br>

> Because unless we offer something that does not harm a no-CRaC use-case<br>

> I am afraid that the adoption will be quite limited.<br>

><br>

> Successful solutions will push the costs into the checkpoint / restore paths as much as possible.  Going back to the explicit lock mechanism you first mentioned, I wonder if there's a role for java.lang.invoke.Switchpoint [1] here?  Switchpoint was added as a tool for language implementers that wanted to be able speculate on a particular condition (ie: CHA assumptions) and get the same kind of low cost state change that existing JITTED code gets.  I'm not sure how well that vision worked in practice or how well Hotspot optimizes it yet, but this might be a reason to push on its performance.<br>

><br>

> Roughly the idea would be to add a couple of Switchpoints to jdk.crac.Core:<br>

><br>

>     public SwitchPoint getBeforeSwitchpoint();<br>

>     public SwitchPoint getAfterSwitchpoint();<br>

><br>

> and users could then write their code using MethodHandles to implementing the branching logic:<br>

><br>

>      MethodHandle normalPath = ...... // existing code<br>

>      MethodHandle fallbackPath = ..... // before Checkpoint extra work<br>

>      MethodHandle guardWithTest = getBeforeSwitchPoint.guardWithTest(normalPath, fallbackPath);<br>

><br>

> and the jdk.crac.Core class would invalidate the "before" SwitchPoint prior to the checkpoint and "after" one after the restore.  Aside from the painful programming model, this might give us the tools we need to make it performant.<br>

><br>

> Needs more exploration and prototyping but would provide a potential path to reasonable performance by burying the extra locking in the fallback paths.  And it would be a single pattern to optimize, rather than all the variations users could produce.<br>

<br>

Well noted, I admit that I haven't heard about SwitchPoint before, I'll <br>

need some time to ingest it and maybe write a JMH tests to see. However <br>

from the first look it is not something that would be too convenient for <br>

users, I would consider some form of interception introduced through <br>

annotations (yes, I come from the EE world).<br></blockquote><div><br></div><div>Not a fan of annotations in the long term.  They are hard to maintain as the underlying implementation changes / is refactored.  Even something as straightforward as the @CallerSensitive annotation is tough to maintain.  These would be a lot harder as they would apply in many more circumstances.</div><div><br></div><div>--Dan</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Thanks!<br>

<br>

Radim<br>

<br>

<br>

><br>

> --Dan<br>

><br>

><br>

> [0] <a href="https://blog.openj9.org/2022/10/14/openj9-criu-support-a-look-under-the-hood/" rel="noreferrer" target="_blank">https://blog.openj9.org/2022/10/14/openj9-criu-support-a-look-under-the-hood/</a><br>

> [1] <a href="https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/SwitchPoint.html" rel="noreferrer" target="_blank">https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/SwitchPoint.html</a><br>

><br>

> Cheers,<br>

><br>

> Radim<br>

><br>

> [1] <a href="https://en.wikipedia.org/wiki/Read-copy-update" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Read-copy-update</a><br>

><br>

> On 03. 04. 23 22:30, Christian Tzolov wrote:<br>

>> Hi, I'm testing CRaC in the context of long-running applications (e.g. streaming, continuous processing ...) and I've stumbled on an issue related to the coordination of the resolved threads.<br>

>><br>

>> For example, let's have a Processor that performs continuous computations. This processor depends on a ProcessorContext and later must be fully initialized before the processor can process any data.<br>

>><br>

>> When the application is first started (e.g. not from checkpoints) it ensures that the ProcessorContext is initialized before starting the Processor loop.<br>

>><br>

>> To leverage CRaC I've implemented a ProcessorContextResource gracefully stops the context on beforeCheckpoint and then re-initialized it on afterRestore.<br>

>><br>

>> When the checkpoint is performed, CRaC calls the ProcessorContextResource.beforeCheckpoint and also preserves the current Processor call stack. On Restore processor's call stack is expectedly restored at the point it was stopped but unfortunately it doesn't wait for the ProcessorContextResource.afterRestore complete. This expectedly crashes the processor.<br>

>><br>

>> The <a href="https://github.com/tzolov/crac-demo" rel="noreferrer" target="_blank">https://github.com/tzolov/crac-demo</a> illustreates this issue. The README explains how to reproduce the issue. The OUTPUT.md (<a href="https://github.com/tzolov/crac-demo/blob/main/OUTPUT.md" rel="noreferrer" target="_blank">https://github.com/tzolov/crac-demo/blob/main/OUTPUT.md</a> ) offers terminal snapshots of the observed behavior.<br>

>><br>

>> I've used latest JDK CRaC release:<br>

>>     openjdk 17-crac 2021-09-14<br>

>>     OpenJDK Runtime Environment (build 17-crac+5-19)<br>

>>     OpenJDK 64-Bit Server VM (build 17-crac+5-19, mixed mode, sharing)<br>

>><br>

>> As I'm new to CRaC, I'd appreciate your thoughts on this issue.<br>

>><br>

>> Cheers,<br>

>> Christian<br>

>><br>

>><br>

>><br>

>><br>

><br>

<br>

</blockquote></div></div>