<div dir="ltr"><div dir="ltr"><div>Hi Radim,</div><div><br></div><div>Thanks a lot for the detailed explanation! That completely cleared up my understanding of the design philosophy behind CRaC.</div><div><br></div><div>It makes perfect sense now that the goal isn't purely transparent restoration, but rather preserving the valuable internal JVM/application state while enabling robust adaptation to the new environment after restore – sacrificing some transparency for resilience by consciously managing external resources. </div><div><br></div><div>Great project, and I appreciate the insight. Hope to be able to contribute down the line!</div><div><br></div><div>Cheers,</div><div>Ma Zhen</div></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">Radim Vansa <<a href="mailto:rvansa@azul.com">rvansa@azul.com</a>> 于2025年4月11日周五 15:17写道:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>
<div>
<p>Hi Ma Zhen,</p>
<p>you have correctly observed that closing file descriptors is
rather an architectural choice than purely a technical need. CRIU
is really capable of restoring the process as-is, as its main
motivation is migration of running containers. Containers already
define the filesystem, and the runtime is in control of external
connections - e.g. CRIU can checkpoint and later restore an open
socket connection, and the container runtime restores the 'second
half' of the socket so that the pause is transparent to the
running process.</p>
<p>If this is what you want, there's nothing preventing you from
using CRIU on a Java process manually - at the risk of breaking
the internal logic of the application. However the point of CRaC
is not such a transparent restore: we want to preserve the
valuable state of JVM and application but adapt it to the new
environment. We want to do a conscious decision about any resource
external to the process. Being forced to gracefully adapt to the
restore is a feature.</p>
<p>Yes, we have File Descriptor policies, but that's not a solution
- it provides a workaround for proof-of-concepts, until some code
that you can't easily fix gets updated to support CRaC properly.
Ideas meet practicality, and you are responsible for realizing
what should be done with particular external resource.</p>
<p>You're right that ATM we don't handle JDK Platform Logging (and
neither JUL) configured to write to a file, and since that is JDK
code out of user control it is a bug. We attempt to fix those one
by one (PRs are welcome!).<br>
</p>
<p>I hope I have provided some insight to these choices - and yes, I
understand the pain as we still have many places to fix.</p>
<p>Cheers, </p>
<p>Radim<br>
</p>
<div>On 10. 04. 25 11:30, ma zhen wrote:<br>
</div>
<blockquote type="cite">
<table width="100%">
<tbody>
<tr>
<td><br>
</td>
<td width="100%">
<div><span>Caution:</span> This email originated from
outside of the organization. Do not click links or open
attachments unless you recognize the sender and know the
content is safe.
</div>
</td>
</tr>
</tbody>
</table>
<br>
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<p>
<span>Hi CRaC developers,</span></p>
<p>
<span><span>I'm currently
exploring the integration of CRaC support into our
company's middleware products. I'm also very
interested in the underlying implementation details
of CRaC and have been doing some research into its
mechanics.</span></span></p>
<p>
<span><span>As I
understand it, CRaC leverages CRIU under the hood
for checkpointing and restoring running processes.
My research indicates that CRIU itself is capable of
handling open file descriptors and established
network connections during the checkpoint/restore
cycle.</span></span></p>
<p>
<span><span>However, the
CRaC API requires developers to explicitly manage
these resources, typically by closing them in the </span><span>beforeCheckpoint()</span><span> and re-establishing
them in the </span><span>afterRestore()</span><span>.</span></span></p>
<p>
<span><span>To understand
the rationale behind this design choice, I looked
into the initial CRaC prototype, specifically the
first PR (<a href="https://github.com/openjdk/crac/pull/1" target="_blank">https://github.com/openjdk/crac/pull/1</a></span><span>). It appears that
even in this early version, the implementation
iterated through all process file descriptors during
checkpoint. It ignored certain FDs (like those
related to classpath files, </span><span>/dev/random</span><span>, </span><span>/dev/urandom</span><span>, and files marked </span><span>M_PERSISTENT</span><span> - though I'm unclear
on the exact meaning of </span><span>M_PERSISTENT</span><span> in this context). If
any other application-opened files remained, the
checkpoint process would fail. This suggests the
requirement for manual resource management was
present from the outset.</span></span></p>
<p>
<span><span>As I'm not
deeply familiar with JVM internals, I'm struggling
to fully grasp the reasoning. Was this restriction
primarily introduced to simplify the initial design
and implementation of CRaC within the JVM?</span></span></p>
<p>
<span><span>I also
noticed that current versions of CRaC include File
Descriptor Policies. These allow configuring an </span><span>action:
ignore</span><span> for
specific file descriptors, effectively delegating
their handling to CRIU. This seems to demonstrate
that letting CRIU manage certain open files </span><span>is</span><span> feasible within the
CRaC framework.</span></span></p>
<p>
<span><span>This leads me
to wonder: if delegation to CRIU is possible and
works (at least for some cases via policies), why
isn't relying on CRIU for resource handling the
default or more broadly encouraged approach? Why the
strict requirement for manual closure and reopening
in the general case?</span></span></p>
<p>
<span><span>For instance,
consider using </span><span>System.getLogger()</span><span> from the JDK
Platform Logging API. As application developers, we
don't typically manage the underlying file
descriptor for the log file directly. To make this
work with CRaC, we currently need to identify and
configure a File Descriptor Policy for it, which can
feel somewhat cumbersome. Wouldn't a smoother
experience involve CRaC (perhaps optionally)
defaulting to letting CRIU handle such internally
managed resources, like those opened by standard JDK
libraries?</span></span></p>
<p>
<span><span>I would
appreciate any insights or clarification you could
offer on the design philosophy behind CRaC's
approach to managing external resources like files
and sockets, especially in contrast to CRIU's
capabilities.</span></span></p>
<p>
<span><span>Thanks for
your time and any insights you can share.</span></span></p>
<p>
<span><span>Best regards,</span></span></p>
<p>
mazhen</p>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote></div>