injecting an AOT cache into a JRE+app "deployment artifact"

Wed Aug 20 10:09:51 UTC 2025

On Thu, 2025-08-14 at 23:11 -0700, John Rose wrote:
> In today’s meeting we discussed a tricky chicken-and-egg problem,
> which is adding an AOT cache into a deployment artifact, where
> the AOT cache came from a training from from (almost) the same
> deployment artifact.  (Almost, but not quite.)
> 
> By “deployment artifact” I mean any organization of a JRE plus
> an application classpath (including JARs) plus any other
> dependencies.  It could be a JRE plus some command line
> arguments plus an assurance that the JAR files mentioned
> on the command line will always be available and won’t
> change.
> 
> The Hermetic Java project aims at making a single executable
> file that contains the full “deployment artifact”, so it
> can serve as a crisp visualization of what I mean by
> “deployment artifact”.
> 
> There are many ways to specify such an artifact, it seems
> to me, including many deployment and packaging facilities
> built to enable cloud computing.  I’ll let others add
> more details about that, if they wish.
> 
> In a basic view, it is a JRE plus some app JARs
> (and maybe other libraries) plus some configuration
> information (often viewable as command line options),
> plus any other dependencies, including an optional
> AOT cache (or CDS archive).
> 
> Now we add a Leyden principle about training runs,
> that a training run should be as similar as possible
> to the ultimate production run, in order to get an
> AOT cache that is tuned for application behavior
> that is typical during final production.
> 
> This takes us to the following puzzle:  To make
> a training run, I need to put together a “deployment
> artifact” that represents, as accurately as possible,
> the actual app (with its JRE and configuration) that
> I intend to deploy “for real”.  But it must lack
> one thing, the AOT cache.  I’m making the training
> run to get that AOT cache.  But when I get it,
> I need to (somehow) retroactively inject the
> AOT cache back into my “deployment artifact”.
> 
> The Leyden JEPs make this look pretty simple:
> Just add some more command line options to pull
> in the AOT cache.  And don’t change anything else!
> 
> But if there is a complicated pipeline for
> application deployment, and/or a special
> “bundle format” (or even a unified executable)
> needed for deployment, then it seems harder
> to say, in a robust manner, how to tweak the
> “deployment artifact” one way (A) to get the AOT
> cache, and then how to tweak it the opposite
> way (B) by injecting the resulting AOT cache
> into the artifact itself.
> 
> I’m making a fuss about this because, depending
> on the details of how much processing and packaging
> is required, it could turn out that those tweaks
> (A) and (B) might perturb the JVM version, JARs
> and/or configuration options enough so that,
> after all that work, the AOT cache does not
> “fit” into the resulting execution.  Instead,
> it detects a configuration mismatch, tragically
> due to its own injection into the final
> “deployment artifact”, and it “falls out”
> of the deployment run.  My concern is to
> make sure this doesn’t happen due to errors
> in packaging.
> 
> Here’s an example of what might go wrong.
> Suppose we have a jlink-like command that
> builds a JVM (from sources) to match some
> configuration parameter.  Suppose that JVM
> has an internal version number which is a UID.
> Suppose we build a “deployment artifact” which
> contains such an ad hoc build of the JVM,
> and perform a training run, obtaining an
> AOT cache.  Now, we re-run our packaging
> workflow, this time with the AOT cache.
> Suppose we rebuild the VM (same sources)
> but we get a new ad hoc UID.  Now the
> AOT cache won’t match.  (…Unless we do
> some bug fixing, but I’m concerned about
> robustly getting the right answer without
> bug fixing.)
> 
> Beyond the simple command line examples shown
> in the JEPs, I have one suggestion for how to make
> these sorts of things work in a reliable manner,
> and that is to bake Leyden-like workflows into
> jlink.
> 
> The jlink command builds a jre, and it can also
> fold in application JARs and various configuration
> settings (AFAIK).

Provided the application JARs are modular JARs. Not all are. Basic
classpath usage doesn't work with jlink as of yet (for inclusion of
app-classes in the modules image).

> So we can focus on jlink as
> a venue for building compatible “deployment
> artifacts”, compatible for both the training
> and production, even though the production
> version has an AOT cache in it, and the
> training one does not (or has a little one).
> 
> Adding jlink allows a workflow like this:
> 
> (a) I run jlink to build a JRE with app JARs
> (b) it gives me DA0 “deployment artifact zero”
> (c) I make a training run using DA0 and get an AOT cache
> (d) I rerun jlink as in (a), except I add the AOT cache
> (e) it gives me DA1 (maybe I handed it DA0 also to edit)
> (f) I make many production runs using DA1

The plus side of such a work-flow is that the JVM and app is included
in your "deployment artifact". The issue of version mismatch goes
largely away.

A concern that comes to mind in such a workflow is that jlink runs
aren't recursive. The bundled JVM in (a) isn't necessarily the same as
after (f). What if (a) runs on JDK (i) and (f) run on JDK (i')?

Why? Because a third-party provided container image was used when (a)
was performed and that container image got updated when I'm done with
the training run (the user is oblivious to this). That's is pretty
common in the cloud world.

It would be nice to allow step (d) on the result of (a) to avoid this.

> By using jlink twice, in a coordinated manner, I am
> assured that, if my AOT cache ever fails to apply
> to a production run, that there is a bug in jlink.
> (Or I have made a production run with an incompatible
> configuration of hardware or GC or whatever, which
> is under my control.)
> 
> Does this help?  It partially depends if users are
> willing to deploy with the help of jlink.  If not,
> then they are “on the hook” to make sure that the
> AOT cache does not “fall down” when they deploy
> for production.
> 
> What about one-file formats, as with Hermetic?
> I think it’s tricker, because DA0 and DA1 are
> two distinct files.  If the AOT cache (in DA1)
> runs a checksum test expecting to see DA0, it
> might fall down when it sees the details of DA1.
> It all depends on how the checksum is organized.

That's an interesting problem. Would a training run of non-hermetic
"deployment artifact" be sufficient as input for a training run for a
hermetic final deployment?

> Anyway, this is as far as I’ve gotten today
> with this interesting chicken-and-egg problem.
> (Or maybe it’s a Heisenbug, if the presence
> of the observing AOT cache disrupts the expected
> observation?)
> 
> I’d love to hear that I’m over-thinking things,
> and that deployment workflows are really not
> that tricky, and that adding training runs
> and AOT caches is straightforward.

My experience with AppCDS and moving that to the cloud tells me this is
going to be tricky. While that experience is clouded (pun not intended)
by the app not being modularized, there are many variables to juggle
when in such an environment which seems a no-brainer when in full
control of the JDKs at development time. JDK containers are rather
fluent and deployments of applications are happening constantly.

So making it hard to make mistakes between training runs and final
deployments seems a worthy goal whichever workflow it ends up being.
That includes built-in tooling to "detect" proper usage of the AOT
cache at runtime.

> (If we try to micro-customize JREs, that adds
> a significant potential cause of AOT cache failure.
> I also worry about re-spinning a one-file artifact.
> Are those the only grounds for me to worry?  Maybe.)
> 
> On the positive side, I think we want to make our
> deployment tools (jlink!) more Leyden-aware, so that
> users don’t have to get too creative in managing
> AOT caches.

+1

My $0.02

Thanks,
Severin