AppCDS / AOT thoughts based on CLI app experience

1 Jun 2022

      Hi,

It feels like most of the interest in static Java comes from the
microservices / functions-as-a-service community. My new company spent
the last year creating a developer tool that runs on the JVM (which
will be useful for Java developers actually, but what it does is
irrelevant here). Internally it's a kind of build system and is thus a
large(ish) CLI app in which startup time and throughput are what
matter most. We also have a separate internal tool that uses Kotlin
scripting to implement a bash-like scripting language, and which is
sensitive in the same ways.

Today the JVM is often overlooked for writing CLI apps due to startup
time, 'lightness' and packaging issues. I figured I'd write down some
notes based on our experiences. They cover workflow, performance,
implementation costs and security issues. Hopefully it's helpful.

1.

I really like AppCDS because:

a. It can't break the app so switching it on/off a no-brainer. Unlike
native-image/static java, no additional testing overhead is created by
it.

b. It's effective even without heap snapshotting. We see a ~40%
speedup for executing --help

c. It's pay-as-you-go. We can use a small archive that's fast to
create to accelerate just the most latency sensitive startup paths, or
we can use it for the whole app, but ultimately costs are
controllable.

d. Archives are deterministic. Modern client-side packaging systems
support delta updates, and CDS plays nicely with them. GraalVM native
images are non-deterministic so every update is going to replace the
entire app, which isn't much fun from an update speed or bandwidth
consumption perspective.

Startup time is dominated by PicoCLI which is a common problem for
Java CLI apps. Supposedly the slowest part is building the model of
the CLI interface using reflection, so it's a perfect candidate for
AppCDS heap snapshotting.  I say supposedly, because I haven't seen
concrete evidence that this is actually where the time goes, but it
seems like a plausible belief. There's a long standing bug filed to
replace reflection with code generation but it's a big job and so
nobody did it.

Unfortunately the app will ship without using AppCDS. Some workflow
issues remain. These can be solved in the app itself, but it'd be nice
if the JVM does it.

The obvious way to use CDS is to ship an archive with the app. We
might do this as a first iteration, but longer term don't want to for
two reasons:

a. The archive can get huge.
b. Signature verification penalties on macOS (see below).

For just making --help and similar short commands faster size isn't so
bad (~6-10mb for us), but if it's used for a whole execution the
archive size for a standard run is nearly the same as total bytecode
size of the app. As more stuff gets cached this will get worse.
Download size might not matter much for this particular app, but as a
general principle it does. So a nice improvement would be to generate
it client side.

CDS files are caches and different platforms have different
conventions for where those go. The JVM doesn't know about those
conventions but our app does, so we'd need our custom native code
launcher (which exists anyway for other reasons) to set the right
paths for CDS.

Then you have to pick the right flags depending on whether the CDS
file exists or not. I follow CDS related changes and believe this is
fixed in latest Java versions but maybe (?) not released yet.

Even once that's fixed it's not quite obvious that we'd use it. The
JVM runs much slower when dumping a dynamic CDS archive and the first
run is when first impressions are made. Whilst for cloud stuff this is
a matter of (artificially?) expensive resources, for CLI apps it's
about more subjective things like feeling snappy. One idea is to delay
dumping a CDS archive until after the first run is exiting, so it
doesn't get in the way. The first run wouldn't benefit from the
archive which is a pity (except on Linux where the package managers
make it easy to run code post-install), but it at least wouldn't be
slowed down by creating it either. The native launcher can schedule
this. Alternatively there could be a brief pause on first run when the
user is told explicitly that the app is optimizing itself, but how
feasible that is depends very much on dump speed. Finally we could
ship a small archive that only covers startup, and then in parallel
make a dump of a full run in the background.

Speaking of which, there's a need for some protocol to drive an app
through a representative 'trial run'. Whether it's generating the
class list or the archive itself, it could be as simple as an
alternative static method that sits next to main. If it were to be
standardized the rest of the infrastructure becomes more re-usable,
for instance build systems can take care of generating classlists, or
the end-user packaging can take care of dynamic dumping.

CDS has two modes and it's not clear which is better. I'm unusually
obsessive about this stuff to the extent of reading the CDS source
code, but despite that I have absolutely no idea if I should be trying
to use static or dynamic archives. There used to be a performance
difference between them but maybe it's fixed now? There's a lack of
end-to-end guidance on how to exploit this feature best.

The ideal would obviously be losing the dump/exec split and make
dynamic dumping continuous, incremental and imposing no performance
penalty. Then we could just supply a path to where the CDS file should
go and things magically warm up across executions. I have no idea how
feasible that is.

Once AppCDS archives are in place and being created at the right
times, a @Snapshotted annotation for fields (or similar) should be an
easy win to eliminate the bulk of the rest of the PicoCLI time.
Dynamically loaded heaps would also be useful to eliminate the
overhead of loading configs and instantiating the (build) task graph
without a Gradle-style daemon.

2.

AppCDS archives can open a subtle security issue when distributing
code to desktop platforms. Because they're full of vtables anyone who
can write to them can (we assume) take over any JVM that loads the
archive and gain whatever privileges have been granted to that app.
The archive file is fully trusted.

On Windows and Linux this doesn't matter. On Linux sensitive files can
be packaged or created in postinst scripts. On Windows either an app
comes with a legacy installer/MSI file and thus doesn't have any
recognized package identity that can be granted extra permissions, or
it uses the current gen MSIX system. In the latter case Windows has a
notion of app identity and so you can request permissions to access
e.g. keychain entries, the user's calendar etc, but in that case
Windows also gives you a private directory that's protected from other
apps where sensitive files can be stashed. AppCDS archives can go
there and we're done.

MacOS is a problem child. There are two situations that matter.

In the first case archives are shipped as data files with the app.
Security is not an issue here, but there's a subtle performance
footgun. On most platforms signatures of files shipped with an app are
checked at install time but on macOS they aren't. Thanks to its NeXT
roots it doesn't really have an installation concept, and thus the
kernel checks signatures of files on first use then caches the
signature check in the kernel vnode. By default the entire file is
hashed in order to link it back to the root signature, which for large
files can impose a small but noticeable delay before the app can open
them. This first run penalty is unfortunate given that AppCDS exists
partly to improve startup time. You can argue it doesn't matter much
due to the caching, but it's worth being aware of - very large AppCDS
archives would get fully paged in and hashed before the app even gets
to do anything. In turn that means people might enable AppCDS with a
big classlist expecting it to speed things up, not noticing that for
Mac users only it slowed things down instead. There are ways to fix
this using supported Apple APIs. One is to supply a CodeDirectory
structure stored in extended attributes: you should get incremental
hashing and normal page fault behaviour (untested!). Another is to
wrap the data in a Mach-O file.

In the second case the CDS archive is being generated client side. Mac
apps don't have anywhere they can create tamperproof data, except for
very small amounts in the keychain. Thus if a Mac app opens a
malicious cache file that can take control of it that's a security
bug, because it'd allow one program to grab any special privileges the
user granted to another. The fact that the grabbing program has passed
GateKeeper and notarization doesn't necessarily matter (Apple's
guidance on this is unclear, but it seems plausible that this is their
stance). In this case the key chain can be used as a root of trust by
storing a hash of the CDS archive in it and checking that after
mmap/before use. Alternatively, again, Apple provides an API that lets
you associate an on-disk (xattr) CodeDirectory structure with a file
which will then be checked incrementally at page fault time. Extreme
care must be taken to avoid race conditions, but in theory, a
CodeDirectory structure can be computed at dump time, written to disk
as an xattr, and then stored again in the key chain (e.g. by
pretending it's a "key" or "password"). After the security API is
instructed to associate a CD with the file, it can be checked against
the tamperproofed version stored in the key chain and if they match,
the archive can then be mmapped and used as normal.

Native images don't have these issues because the state snapshot is
stored inside the Mach-O file and thus gets covered by the normal
mechanisms. However once it adds support for persisted heaps, the same
issue may arise.

Whether it's worth doing the extra work to solve this is unclear. Macs
are guaranteed to come with very fast NVMe disks and CPUs. Still, it's
worth being aware of the issue.

3.

Why not just use a native image then? Maybe we'll do that because the
performance wins are really compelling, but again, v1 will ship
without this for the following reasons:

a. Static minification can break things. Our integration tests
currently invoke the entry point of the app directly, but that could
be fixed to run the tool in an external process. For unit tests the
situation is far murkier. It's a bit unclear how to run JUnit tests
against the statically compiled version and it may not even make sense
(because the tests would pin a bunch of code that might get stripped
in the real app so what are you really testing?).

b. It'd break delta updates. Not the end of the world, but a factor.

c. I have no idea if we're using any libraries that spin bytecode
dynamically. Even if we're not today, what if tomorrow we want to use
such a library? Do we have to avoid using it and increase the cost of
feature development, or roll back the native image and give our users
a nasty performance downgrade? Neither option is attractive. Ideally
SubstrateVM would contain a bytecode interpreter and use it when
necessary. Lots of issues there but e.g. it'd probably be OK if it's
not a general classloader and the code dependencies have to be known
AOT.

d. Similar to (c), fully AOT compilation can explode code and thus
download size even though many codepaths are cold and only execute
once. It'd be nice if a native image could include a mix of bytecode
and AOT compiled hotspots.

e. Once you're past the initial interactive stage the program is
throughput sensitive. How much of a perf downgrade over HotSpot would
we get, if any? With GraalVM EE we could use PGO and not lose any, but
the ISV pricing is opaque. At any rate to answer this we have to fix
the compatibility issues first. The prospect of improving startup time
and then discovering we slowed down the actual builds isn't really
appealing (though I suspect in our case AOT wouldn't really hurt
much).

f. What if we want to support in-process plugins? Maybe we can use
Espresso, but this is a road less travelled (lack of tutorials, well
documented examples etc).

An interesting possibility is using a mix of approaches. For the bash
competitor I mentioned earlier dynamic code loading is needed because
the script bytecode is loaded into the host JVM, but the Kotlin
compiler itself could theoretically be statically compiled to a JNI or
Panama-accessible library. We tried this before and hit compatibility
errors, but didn't make any effort to resolve them.

4.

What about CRaC? It's Linux only so isn't interesting to us, given
that most devs are on Windows/macOS. The benefits for Linux servers
are clear though. Obvious question - can you make a snapshot on one
machine/Linux distro, and resume them on a totally different one, or
does it require a homogenous infrastructure?

5.

A big reason AppCDS is nice is we get to keep the open world. This
isn't only about compatibility, open worlds are just better. The most
popular way to get software to desktop machines is Chrome and the web
is totally open world. Apps are downloaded incrementally as the user
navigates around, and companies exploit this fact aggressively. Large
web sites can be far larger than would be considered practical to
distribute to end user machines, and can easily update 50 times a day.
Web developers have to think about latency on specific interactions,
but they don't have to think about the size of the entire app and that
allows them to scale up feature sets as fast as funding allows. In
contrast the closed world mobile versions of their sites are a parade
of horror stories in which firms have to e.g. hotpatch Dalvik to work
around method count limits (Facebook), or in which code size issues
nearly wrecked the entire company (Uber):

https://twitter.com/StanTwinB/status/1336914412708405248

Right now code size isn't a particularly serious problem for us, but
the ease of including open source libraries means footprint grows all
the time. Especially for our shell scripting tool, there are tons of
cool features that could be added but if we did all of them we'd
probably end up with 500mb of bytecode. With an open world features
can be downloaded on the fly as they get used and you can build a
plugin ecosystem.

The new more incremental direction of Leyden is thus welcomed and
appreciated, because it feels like a lot of ground can be covered by
"small" changes like upgrading AppCDS and caching compiled hotspots.
Even if the results aren't as impressive as with native-image, the
benefits of keeping an open world can probably make up for it, at
least for our use cases.

Mike Hearn

Andrew Dinn

Mike Hearn

Anton Kozlov

Ioi Lam

Ioi Lam

Mike Hearn

tags

participants (4)