On restore the "main" thread is started before the Resource's afterRestore has completed

Christian Tzolov christian.tzolov at gmail.com
Mon Apr 3 20:30:16 UTC 2023


Hi, I'm testing CRaC in the context of long-running applications (e.g.
streaming, continuous processing ...) and I've stumbled on an issue related
to the coordination of the resolved threads.

For example, let's have a *Processor* that performs continuous
computations. This processor depends on a *ProcessorContext* and later must
be fully initialized before the processor can process any data.

When the application is first started (e.g. not from checkpoints) it
ensures that the *ProcessorContext* is initialized before starting the
*Processor* loop.

To leverage CRaC I've implemented a *ProcessorContextResource* gracefully
stops the context on *beforeCheckpoint* and then re-initialized it on
*afterRestore*.

When the checkpoint is performed, CRaC calls the *ProcessorContextResource.*
*beforeCheckpoint* and also preserves the current *Processor* call stack.
On Restore processor's call stack is expectedly restored at the point it
was stopped but unfortunately it doesn't wait for the
*ProcessorContextResource.**afterRestore* complete. This expectedly crashes
the processor.

The https://github.com/tzolov/crac-demo illustreates this issue. The README
explains how to reproduce the issue. The OUTPUT.md (
https://github.com/tzolov/crac-demo/blob/main/OUTPUT.md ) offers terminal
snapshots of the observed behavior.

I've used latest JDK CRaC release:
  openjdk 17-crac 2021-09-14
  OpenJDK Runtime Environment (build 17-crac+5-19)
  OpenJDK 64-Bit Server VM (build 17-crac+5-19, mixed mode, sharing)

As I'm new to CRaC, I'd appreciate your thoughts on this issue.

Cheers,
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20230403/cd8191ac/attachment-0001.htm>


More information about the crac-dev mailing list