<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Hi, comments inline...<br>
    </p>
    <div class="moz-cite-prefix">On 06. 04. 23 15:59, Christian Tzolov
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:CAGNdOEXLFeKArno2md8-wDZvVgFzy45ExXn93rMWvgVkBxNWLg@mail.gmail.com">
      <table width="100%">
        <tbody>
          <tr>
            <td><br>
            </td>
            <td width="100%">
              <div><span>Caution:</span> This email originated from
                outside of the organization. Do not click links or open
                attachments unless you recognize the sender and know the
                content is safe.
              </div>
            </td>
          </tr>
        </tbody>
      </table>
      <br>
      <div>
        <div dir="ltr">
          <div dir="ltr"><span id="gmail-docs-internal-guid-74c79e01-7fff-dec0-f617-b997242791e1">
              <p dir="ltr">Hi Dan and Radim,<br>
                <br>
                <br>
                Thanks for the feedback and suggestions! </p>
              <p dir="ltr">It is the first time I’m facing the
                java.lang.invoke.* API and it might take some time to
                wrap my head around it. </p>
              <p dir="ltr">So be prepared plese for lame questions, as
                those inlined below.<br>
                <br>
                <br>
                On Wed, Apr 5, 2023 at 4:28 PM Dan Heidinga <<a href="mailto:heidinga@redhat.com" moz-do-not-send="true" class="moz-txt-link-freetext">heidinga@redhat.com</a>>
                wrote:<br>
              </p>
            </span></div>
          <div class="gmail_quote">
            <blockquote class="gmail_quote">
              <div dir="ltr">
                <div>Hi Radim,</div>
                <div><br>
                </div>
                <div>Thanks for the write up of the various options in
                  this space.</div>
                <br>
                <div class="gmail_quote">
                  <div dir="ltr" class="gmail_attr">On Tue, Apr 4, 2023
                    at 2:49 AM Radim Vansa <<a href="mailto:rvansa@azul.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">rvansa@azul.com</a>>
                    wrote:<br>
                  </div>
                  <blockquote class="gmail_quote">
                    Hi Christian,<br>
                    <br>
                    I believe this is a common problem when porting
                    existing architecture <br>
                    under CRaC; the obvious solution is to guard access
                    to the resource <br>
                    (ProcessorContext in this case) with a RW lock
                    that'd be read-acquired <br>
                    by 'regular' access and acquired for write in
                    beforeCheckpoint/released <br>
                    in afterRestore. However this introduces extra
                    synchronization (at least <br>
                    in form of volatile writes) even in case that C/R is
                    not used at all, <br>
                    especially if the support is added into libraries.<br>
                  </blockquote>
                  <div><br>
                  </div>
                  <div>I've seen variations of this approach go by in
                    code reviews but have we written up a good example
                    of how to do this well?  Having a canonical pattern
                    would help to highlight the best way to do it today
                    and make the tradeoffs explicit.</div>
                </div>
              </div>
            </blockquote>
            <div class="gmail_quote"><br>
            </div>
            @Radim, your “guard access” suggestion made me realise that
            perhaps I’ve oversimplified my  sample. </div>
          <div class="gmail_quote"><br>
          </div>
          <div class="gmail_quote">So I’ve modified it a bit: <a href="https://github.com/tzolov/crac-demo/blob/main/src/main/java/com/example/crac/CrackDemoExt.java" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/tzolov/crac-demo/blob/main/src/main/java/com/example/crac/CrackDemoExt.java</a> 
            by introducing a new ProcessorState used by the Processor
            for its computation. </div>
          <div class="gmail_quote">At the same time I’ve removed the
            direct Processor dependency on the ProcessorContext. Instead
            the ProcessorContext is responsible for managing the
            lifecycle of the ProcessorState before the Processor can use
            it.<br>
            Then given your original suggestion is it right to assume
            that the “guard access to the resource” now should guard the
            ProcessorState not the ProcessorContext? </div>
          <div class="gmail_quote">And if this is true then how one
            would be able to identify all possible “resources” to be
            guarded?
            <br>
          </div>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>It seems that the separation between Context and State is a bit
      artificial, but anyway... Context here would hold a RW lock,
      write-locked in constructor. At the end of start() method it would
      unlock it, and at the beginning of stop() it would lock it. In
      your case the Processor uses that state directly, rather than
      through Context - that gives you no place to put the read lock.
      Instead, it should be delegated through Context that would
      read-lock it before useState() and unlock afterwards.</p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:CAGNdOEXLFeKArno2md8-wDZvVgFzy45ExXn93rMWvgVkBxNWLg@mail.gmail.com">
      <div>
        <div dir="ltr">
          <div class="gmail_quote">
             </div>
          <div class="gmail_quote">
            <blockquote class="gmail_quote">
              <div dir="ltr">
                <div class="gmail_quote">
                  <blockquote class="gmail_quote">
                    Anton Kozlov proposed techniques like RCU [1] but at
                    this point there's <br>
                    no support for this in Java. Even the Linux
                    implementation might require <br>
                    some additional properties from the code in critical
                    (read) section like <br>
                    not calling any blocking code; this might be too
                    limiting.<br>
                    <br>
                    The situation is simpler if the application uses a
                    single threaded <br>
                    event-loop; beforeCheckpoint can enqueue a task that
                    would, upon its <br>
                    execution, block on a primitive and notify the C/R
                    notification thread <br>
                    that it may now deinit the resource; in afterRestore
                    the resource is <br>
                    initialized and the eventloop is unblocked. This way
                    we don't impose any <br>
                    extra overhead when C/R is happening.<br>
                  </blockquote>
                  <div><br>
                  </div>
                  <div>That's a nice idea!</div>
                  <div> </div>
                  <blockquote class="gmail_quote">
                    <br>
                    To avoid extra synchronization it could be
                    technically possible to <br>
                    modify CRaC implementation to keep all other threads
                    frozen during <br>
                    restore. There's a risk of some form of deadlock if
                    the thread <br>
                    performing C/R would require other threads to
                    progress, though, so any <br>
                    such solution would require extra thoughts. Besides,
                    this does not <br>
                    guarantee exclusivity so the afterRestore would need
                    to restore the <br>
                    resource to the *exactly* same state (as some of its
                    before-checkpoint <br>
                    state might have leaked to the thread in Processor).
                    In my opinion this <br>
                    is not the best way.<br>
                  </blockquote>
                  <div><br>
                  </div>
                  <div>This is the approach that OpenJ9 took to solve
                    the consistency problems introduced by updating
                    resources before / after checkpoints.  OpenJ9 enters
                    "single threaded mode" when creating the checkpoint
                    and executing the before checkkpoint fixups.  On
                    restore, it continues in single-threaded mode while
                    executing the after checkpoint fixups.  This makes
                    it easier to avoid additional runtime costs related
                    to per-resource locking for checkpoints, but
                    complicates locking and wait/notify in general.</div>
                  <div><br>
                  </div>
                  <div>This means a checkpoint hook operation can't wait
                    on another thread (would block indefinitely as other
                    threads are paused), can't wait on a lock being held
                    by another thread (again, would deadlock), and
                    sending notify may result in inconsistent behaviour
                    (wrong number of notifies received by other
                    threads).  See "The checkpointJVM() API" section of
                    their blog post on CRIU for more details [0].</div>
                </div>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>The "single thread mode", imo, corresponds to the
              "serializable isolation" approach in data processing and
              DB transactions. </div>
          </div>
        </div>
      </div>
    </blockquote>
    <p>Serialization into one thread is one way to achieve serializable
      isolation, but there are different strategies too. Though beware
      that no major database nowadays supports strict serializable
      isolation (even if it calls some mode serializable) - can't find a
      proper link to show, and I am digressing anyway. <br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:CAGNdOEXLFeKArno2md8-wDZvVgFzy45ExXn93rMWvgVkBxNWLg@mail.gmail.com">
      <div>
        <div dir="ltr">
          <div class="gmail_quote">
            <div>The OpenJ9 blogs are very informative and like the jdk
              invoke API would need time to digest. </div>
            <div>But I have one conceptual question. What part of this
              should/cloud be implemented by the CRaC inself and what
              abstractions should be exposed to the CRaC users? <br>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <p><br>
    </p>
    <p>CRaC users need to be aware that their task is to clean up before
      checkpoint. Ideally if they use a library this should do it
      transparently to any practical extent. Anything beyond is just
      utilities provided, you need to dig it so here's (hopefully
      appropriate) spade.</p>
    <p><br>
    </p>
    <p>Radim<br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:CAGNdOEXLFeKArno2md8-wDZvVgFzy45ExXn93rMWvgVkBxNWLg@mail.gmail.com">
      <div>
        <div dir="ltr">
          <div class="gmail_quote">
            <div><br>
            </div>
            <blockquote class="gmail_quote">
              <div dir="ltr">
                <div class="gmail_quote">
                  <blockquote class="gmail_quote">
                    <br>
                    The problem with RCU is tracking which threads are
                    in the critical <br>
                    section. I've found RCU-like implementations for
                    Java that avoid <br>
                    excessive overhead using a spread out array - each
                    thread marks <br>
                    entering/leaving the critical section by writes to
                    its own counter, <br>
                    preventing cache ping-pong (assuming no false
                    sharing). Synchronizer <br>
                    thread uses another flag to request synchronization;
                    reading this by <br>
                    each thread is not totally without cost but
                    reasonably cheap, and in <br>
                    that case worker threads can enter a blocking slow
                    path. The simple <br>
                    implementation assumes a fixed number of threads; if
                    the list of threads <br>
                    is dynamic the solution would be probably more
                    complicated. It might <br>
                    also make sense to implement this in native code
                    with a per-CPU <br>
                    counters, rather than per-thread. A downside,
                    besides some overhead in <br>
                    terms of both cycles and memory usage, is that we'd
                    need to modify the <br>
                    code and explicitly mark the critical sections.<br>
                    <br>
                    Another solution could try to leverage existing JVM
                    mechanics for code <br>
                    deoptimization, replacing the critical sections with
                    a slower, blocking <br>
                    stub, and reverting back after restore. Or even
                    independently requesting <br>
                    a safe-point and inspecting stack of threads until
                    the synchronization <br>
                    is possible.<br>
                  </blockquote>
                  <div><br>
                  </div>
                  <div>This will have a high risk of livelock.  The
                    OpenJ9 experience implementing single-threaded mode
                    for CRIU indicates there are a lot of strange
                    locking patterns in the world.</div>
                  <div> </div>
                  <blockquote class="gmail_quote">
                    <br>
                    So I probably can't offer a ready-to-use performant
                    solution; pick your <br>
                    poison. The future, though, offers a few
                    possibilities and I'd love to <br>
                    hear others' opinions about which one would look the
                    most feasible. <br>
                    Because unless we offer something that does not harm
                    a no-CRaC use-case <br>
                    I am afraid that the adoption will be quite limited.<br>
                  </blockquote>
                  <div><br>
                  </div>
                  <div>Successful solutions will push the costs into the
                    checkpoint / restore paths as much as possible. 
                    Going back to the explicit lock mechanism you first
                    mentioned, I wonder if there's a role for
                    java.lang.invoke.Switchpoint [1] here?  Switchpoint
                    was added as a tool for language implementers that
                    wanted to be able speculate on a particular
                    condition (ie: CHA assumptions) and get the same
                    kind of low cost state change that existing JITTED
                    code gets.  I'm not sure how well that vision worked
                    in practice or how well Hotspot optimizes it yet,
                    but this might be a reason to push on its
                    performance.</div>
                  <div><br>
                  </div>
                  <div>Roughly the idea would be to add a couple of
                    Switchpoints to jdk.crac.Core:</div>
                  <div><br>
                  </div>
                  <div>   public SwitchPoint getBeforeSwitchpoint();</div>
                  <div>   public SwitchPoint getAfterSwitchpoint();</div>
                  <div><br>
                  </div>
                  <div>and users could then write their code using
                    MethodHandles to implementing the branching logic:</div>
                  <div><br>
                  </div>
                  <div>    MethodHandle normalPath = ...... // existing
                    code</div>
                  <div>    MethodHandle fallbackPath = ..... // before
                    Checkpoint extra work</div>
                  <div>    MethodHandle guardWithTest =
                    getBeforeSwitchPoint.guardWithTest(normalPath,
                    fallbackPath);</div>
                  <div><br>
                  </div>
                  <div>and the jdk.crac.Core class would invalidate the
                    "before" SwitchPoint prior to the checkpoint and
                    "after" one after the restore.  Aside from the
                    painful programming model, this might give us the
                    tools we need to make it performant.</div>
                </div>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>@Dan, this is very interesting! </div>
            <div>Could you please elaborate a bit further. Perhaps in
              the context of the CrackDemoExt.java sample? </div>
            <div> </div>
            <blockquote class="gmail_quote">
              <div dir="ltr">
                <div class="gmail_quote">
                  <div><br>
                  </div>
                  <div>Needs more exploration and prototyping but would
                    provide a potential path to reasonable performance
                    by burying the extra locking in the fallback paths. 
                    And it would be a single pattern to optimize, rather
                    than all the variations users could produce.</div>
                  <div>--Dan</div>
                  <div>[0] <a href="https://blog.openj9.org/2022/10/14/openj9-criu-support-a-look-under-the-hood/" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://blog.openj9.org/2022/10/14/openj9-criu-support-a-look-under-the-hood/</a><br>
                  </div>
                  <div>[1] <a href="https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/SwitchPoint.html" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/SwitchPoint.html</a></div>
                </div>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>Thank you,</div>
            <div> - Christian</div>
            <div><br>
            </div>
            <div> </div>
            <blockquote class="gmail_quote">
              <div dir="ltr">
                <div class="gmail_quote">
                  <blockquote class="gmail_quote">
                    <br>
                    Cheers,<br>
                    <br>
                    Radim<br>
                    <br>
                    [1] <a href="https://en.wikipedia.org/wiki/Read-copy-update" rel="noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">
                      https://en.wikipedia.org/wiki/Read-copy-update</a><br>
                    <br>
                    On 03. 04. 23 22:30, Christian Tzolov wrote:<br>
                    > Hi, I'm testing CRaC in the context of
                    long-running applications (e.g. streaming,
                    continuous processing ...) and I've stumbled on an
                    issue related to the coordination of the resolved
                    threads.<br>
                    ><br>
                    > For example, let's have a Processor that
                    performs continuous computations. This processor
                    depends on a ProcessorContext and later must be
                    fully initialized before the processor can process
                    any data.<br>
                    ><br>
                    > When the application is first started (e.g. not
                    from checkpoints) it ensures that the
                    ProcessorContext is initialized before starting the
                    Processor loop.<br>
                    ><br>
                    > To leverage CRaC I've implemented a
                    ProcessorContextResource gracefully stops the
                    context on beforeCheckpoint and then re-initialized
                    it on afterRestore.<br>
                    ><br>
                    > When the checkpoint is performed, CRaC calls
                    the ProcessorContextResource.beforeCheckpoint and
                    also preserves the current Processor call stack. On
                    Restore processor's call stack is expectedly
                    restored at the point it was stopped but
                    unfortunately it doesn't wait for the
                    ProcessorContextResource.afterRestore complete. This
                    expectedly crashes the processor.<br>
                    ><br>
                    > The <a href="https://github.com/tzolov/crac-demo" rel="noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">
                      https://github.com/tzolov/crac-demo</a>
                    illustreates this issue. The README explains how to
                    reproduce the issue. The OUTPUT.md (<a href="https://github.com/tzolov/crac-demo/blob/main/OUTPUT.md" rel="noreferrer" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/tzolov/crac-demo/blob/main/OUTPUT.md</a>
                    ) offers terminal snapshots of the observed
                    behavior.<br>
                    ><br>
                    > I've used latest JDK CRaC release:<br>
                    >    openjdk 17-crac 2021-09-14<br>
                    >    OpenJDK Runtime Environment (build
                    17-crac+5-19)<br>
                    >    OpenJDK 64-Bit Server VM (build
                    17-crac+5-19, mixed mode, sharing)<br>
                    ><br>
                    > As I'm new to CRaC, I'd appreciate your
                    thoughts on this issue.<br>
                    ><br>
                    > Cheers,<br>
                    > Christian<br>
                    ><br>
                    ><br>
                    ><br>
                    ><br>
                    <br>
                  </blockquote>
                </div>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
  </body>
</html>