AppCDS / AOT thoughts based on CLI app experience

Ioi Lam ioi.lam at oracle.com
Fri Jun 3 00:15:36 UTC 2022


Hi Mike,

I am thrilled to hear that you're happy with CDS. Please see my 
responses below.

If you have other questions or requests for CDS, please let me know :-)

On 6/1/2022 6:03 AM, Mike Hearn wrote:
> Hi,
>
> It feels like most of the interest in static Java comes from the
> microservices / functions-as-a-service community. My new company spent
> the last year creating a developer tool that runs on the JVM (which
> will be useful for Java developers actually, but what it does is
> irrelevant here). Internally it's a kind of build system and is thus a
> large(ish) CLI app in which startup time and throughput are what
> matter most. We also have a separate internal tool that uses Kotlin
> scripting to implement a bash-like scripting language, and which is
> sensitive in the same ways.
>
> Today the JVM is often overlooked for writing CLI apps due to startup
> time, 'lightness' and packaging issues. I figured I'd write down some
> notes based on our experiences. They cover workflow, performance,
> implementation costs and security issues. Hopefully it's helpful.
>
> 1.
>
> I really like AppCDS because:
>
> a. It can't break the app so switching it on/off a no-brainer. Unlike
> native-image/static java, no additional testing overhead is created by
> it.
>
> b. It's effective even without heap snapshotting. We see a ~40%
> speedup for executing --help
>
> c. It's pay-as-you-go. We can use a small archive that's fast to
> create to accelerate just the most latency sensitive startup paths, or
> we can use it for the whole app, but ultimately costs are
> controllable.
>
> d. Archives are deterministic. Modern client-side packaging systems
> support delta updates, and CDS plays nicely with them. GraalVM native
> images are non-deterministic so every update is going to replace the
> entire app, which isn't much fun from an update speed or bandwidth
> consumption perspective.
>
> Startup time is dominated by PicoCLI which is a common problem for
> Java CLI apps. Supposedly the slowest part is building the model of
> the CLI interface using reflection, so it's a perfect candidate for
> AppCDS heap snapshotting.  I say supposedly, because I haven't seen
> concrete evidence that this is actually where the time goes, but it
> seems like a plausible belief. There's a long standing bug filed to
> replace reflection with code generation but it's a big job and so
> nobody did it.
>
> Unfortunately the app will ship without using AppCDS. Some workflow
> issues remain. These can be solved in the app itself, but it'd be nice
> if the JVM does it.
>
> The obvious way to use CDS is to ship an archive with the app. We
> might do this as a first iteration, but longer term don't want to for
> two reasons:
>
> a. The archive can get huge.
> b. Signature verification penalties on macOS (see below).
>
> For just making --help and similar short commands faster size isn't so
> bad (~6-10mb for us), but if it's used for a whole execution the
> archive size for a standard run is nearly the same as total bytecode
> size of the app. As more stuff gets cached this will get worse.
> Download size might not matter much for this particular app, but as a
> general principle it does. So a nice improvement would be to generate
> it client side.
>
> CDS files are caches and different platforms have different
> conventions for where those go. The JVM doesn't know about those
> conventions but our app does, so we'd need our custom native code
> launcher (which exists anyway for other reasons) to set the right
> paths for CDS.
>
> Then you have to pick the right flags depending on whether the CDS
> file exists or not. I follow CDS related changes and believe this is
> fixed in latest Java versions but maybe (?) not released yet.

Which version of Java are you using?

Since JDK 11, the default value of -Xshare is set to -Xshare:auto, so 
you can always do this:

$ java -XX:SharedArchiveFile=nosuch.jsa -version
java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)

If the file exists, it will be used automatically. Otherwise the VM will 
silently ignore the archive.

Since JDK 17, a default CDS archive is shipped with the JDK. So you will 
at least get some performance benefits of CDS for the built-in classes.

With the upcoming JDK 19, we have implemented a new feature (See 
JDK-8261455) to automatically create the CDS archive. Here's an example 
(I am using Javac because it's convenient, but you need to quote the JVM 
parameters with -J):

$ javac -J-XX:+AutoCreateSharedArchive -J-XX:SharedArchiveFile=javac.jsa 
HelloWorld.java

javac.jsa will be automatically created if it doesn't exist, or if it's 
not compatible with the JVM (e.g., if you have upgraded to a newer JDK).

In this case, the total elapsed time is improved from about 522ms (with 
default CDS archive) to 330ms (auto-generated archive).


> Even once that's fixed it's not quite obvious that we'd use it. The
> JVM runs much slower when dumping a dynamic CDS archive and the first
> run is when first impressions are made. Whilst for cloud stuff this is
> a matter of (artificially?) expensive resources, for CLI apps it's
> about more subjective things like feeling snappy. One idea is to delay
> dumping a CDS archive until after the first run is exiting, so it
> doesn't get in the way. The first run wouldn't benefit from the
> archive which is a pity (except on Linux where the package managers
> make it easy to run code post-install), but it at least wouldn't be
> slowed down by creating it either. The native launcher can schedule
> this. Alternatively there could be a brief pause on first run when the
> user is told explicitly that the app is optimizing itself, but how
> feasible that is depends very much on dump speed. Finally we could
> ship a small archive that only covers startup, and then in parallel
> make a dump of a full run in the background.

The dynamic CDS dumping happens when the JVM exits. We could ... (just 
throwing half-baked ideas) spawn a new daemon subprocess to do the 
dumping, while the main JVM process exits. So to the user there's no 
penalty.


> Speaking of which, there's a need for some protocol to drive an app
> through a representative 'trial run'. Whether it's generating the
> class list or the archive itself, it could be as simple as an
> alternative static method that sits next to main. If it were to be
> standardized the rest of the infrastructure becomes more re-usable,
> for instance build systems can take care of generating classlists, or
> the end-user packaging can take care of dynamic dumping.

Maybe we could have some sort of daemon that collects profiling data in 
the background, and update the archives when the application behavior is 
more understood.

> CDS has two modes and it's not clear which is better. I'm unusually
> obsessive about this stuff to the extent of reading the CDS source
> code, but despite that I have absolutely no idea if I should be trying
> to use static or dynamic archives. There used to be a performance
> difference between them but maybe it's fixed now? There's a lack of
> end-to-end guidance on how to exploit this feature best.

I agree our documentation is kind of lacking. We'll try to improve it.

Static and dynamic archives will be roughly the same speed (~10 ms 
faster with static dump for the javac example above).

The dynamic archive will be smaller, because it doesn't need to 
duplicate the built-in classes that are already in the static archive. 
Here's a size comparison for javac.jsa

static:  20,217,856 bytes
dynamic: 10,153,984 bytes


> The ideal would obviously be losing the dump/exec split and make
> dynamic dumping continuous, incremental and imposing no performance
> penalty. Then we could just supply a path to where the CDS file should
> go and things magically warm up across executions. I have no idea how
> feasible that is.
>
> Once AppCDS archives are in place and being created at the right
> times, a @Snapshotted annotation for fields (or similar) should be an
> easy win to eliminate the bulk of the rest of the PicoCLI time.
> Dynamically loaded heaps would also be useful to eliminate the
> overhead of loading configs and instantiating the (build) task graph
> without a Gradle-style daemon.
>
> 2.
>
> AppCDS archives can open a subtle security issue when distributing
> code to desktop platforms. Because they're full of vtables anyone who
> can write to them can (we assume) take over any JVM that loads the
> archive and gain whatever privileges have been granted to that app.
> The archive file is fully trusted.

Will you have a similar problem if the JAR file of the application is 
maliciously modified?

Actually the vtables inside the CDS archive file contain all zeros, and 
are filled in by the VM after the archive is mapped.

What could be modified is the vtptr of archived MetaData objects. They 
usually point to somewhere near 0x800000000 (where the vtables are) but 
the attacker could modify them to point to arbitrary locations. I am not 
sure if this type of attack is easier than modifying the JAR files, or not.

Thanks
- Ioi


>
> On Windows and Linux this doesn't matter. On Linux sensitive files can
> be packaged or created in postinst scripts. On Windows either an app
> comes with a legacy installer/MSI file and thus doesn't have any
> recognized package identity that can be granted extra permissions, or
> it uses the current gen MSIX system. In the latter case Windows has a
> notion of app identity and so you can request permissions to access
> e.g. keychain entries, the user's calendar etc, but in that case
> Windows also gives you a private directory that's protected from other
> apps where sensitive files can be stashed. AppCDS archives can go
> there and we're done.
>
> MacOS is a problem child. There are two situations that matter.
>
> In the first case archives are shipped as data files with the app.
> Security is not an issue here, but there's a subtle performance
> footgun. On most platforms signatures of files shipped with an app are
> checked at install time but on macOS they aren't. Thanks to its NeXT
> roots it doesn't really have an installation concept, and thus the
> kernel checks signatures of files on first use then caches the
> signature check in the kernel vnode. By default the entire file is
> hashed in order to link it back to the root signature, which for large
> files can impose a small but noticeable delay before the app can open
> them. This first run penalty is unfortunate given that AppCDS exists
> partly to improve startup time. You can argue it doesn't matter much
> due to the caching, but it's worth being aware of - very large AppCDS
> archives would get fully paged in and hashed before the app even gets
> to do anything. In turn that means people might enable AppCDS with a
> big classlist expecting it to speed things up, not noticing that for
> Mac users only it slowed things down instead. There are ways to fix
> this using supported Apple APIs. One is to supply a CodeDirectory
> structure stored in extended attributes: you should get incremental
> hashing and normal page fault behaviour (untested!). Another is to
> wrap the data in a Mach-O file.
>
> In the second case the CDS archive is being generated client side. Mac
> apps don't have anywhere they can create tamperproof data, except for
> very small amounts in the keychain. Thus if a Mac app opens a
> malicious cache file that can take control of it that's a security
> bug, because it'd allow one program to grab any special privileges the
> user granted to another. The fact that the grabbing program has passed
> GateKeeper and notarization doesn't necessarily matter (Apple's
> guidance on this is unclear, but it seems plausible that this is their
> stance). In this case the key chain can be used as a root of trust by
> storing a hash of the CDS archive in it and checking that after
> mmap/before use. Alternatively, again, Apple provides an API that lets
> you associate an on-disk (xattr) CodeDirectory structure with a file
> which will then be checked incrementally at page fault time. Extreme
> care must be taken to avoid race conditions, but in theory, a
> CodeDirectory structure can be computed at dump time, written to disk
> as an xattr, and then stored again in the key chain (e.g. by
> pretending it's a "key" or "password"). After the security API is
> instructed to associate a CD with the file, it can be checked against
> the tamperproofed version stored in the key chain and if they match,
> the archive can then be mmapped and used as normal.
>
> Native images don't have these issues because the state snapshot is
> stored inside the Mach-O file and thus gets covered by the normal
> mechanisms. However once it adds support for persisted heaps, the same
> issue may arise.
>
> Whether it's worth doing the extra work to solve this is unclear. Macs
> are guaranteed to come with very fast NVMe disks and CPUs. Still, it's
> worth being aware of the issue.
>
> 3.
>
> Why not just use a native image then? Maybe we'll do that because the
> performance wins are really compelling, but again, v1 will ship
> without this for the following reasons:
>
> a. Static minification can break things. Our integration tests
> currently invoke the entry point of the app directly, but that could
> be fixed to run the tool in an external process. For unit tests the
> situation is far murkier. It's a bit unclear how to run JUnit tests
> against the statically compiled version and it may not even make sense
> (because the tests would pin a bunch of code that might get stripped
> in the real app so what are you really testing?).
>
> b. It'd break delta updates. Not the end of the world, but a factor.
>
> c. I have no idea if we're using any libraries that spin bytecode
> dynamically. Even if we're not today, what if tomorrow we want to use
> such a library? Do we have to avoid using it and increase the cost of
> feature development, or roll back the native image and give our users
> a nasty performance downgrade? Neither option is attractive. Ideally
> SubstrateVM would contain a bytecode interpreter and use it when
> necessary. Lots of issues there but e.g. it'd probably be OK if it's
> not a general classloader and the code dependencies have to be known
> AOT.
>
> d. Similar to (c), fully AOT compilation can explode code and thus
> download size even though many codepaths are cold and only execute
> once. It'd be nice if a native image could include a mix of bytecode
> and AOT compiled hotspots.
>
> e. Once you're past the initial interactive stage the program is
> throughput sensitive. How much of a perf downgrade over HotSpot would
> we get, if any? With GraalVM EE we could use PGO and not lose any, but
> the ISV pricing is opaque. At any rate to answer this we have to fix
> the compatibility issues first. The prospect of improving startup time
> and then discovering we slowed down the actual builds isn't really
> appealing (though I suspect in our case AOT wouldn't really hurt
> much).
>
> f. What if we want to support in-process plugins? Maybe we can use
> Espresso, but this is a road less travelled (lack of tutorials, well
> documented examples etc).
>
> An interesting possibility is using a mix of approaches. For the bash
> competitor I mentioned earlier dynamic code loading is needed because
> the script bytecode is loaded into the host JVM, but the Kotlin
> compiler itself could theoretically be statically compiled to a JNI or
> Panama-accessible library. We tried this before and hit compatibility
> errors, but didn't make any effort to resolve them.
>
> 4.
>
> What about CRaC? It's Linux only so isn't interesting to us, given
> that most devs are on Windows/macOS. The benefits for Linux servers
> are clear though. Obvious question - can you make a snapshot on one
> machine/Linux distro, and resume them on a totally different one, or
> does it require a homogenous infrastructure?
>
> 5.
>
> A big reason AppCDS is nice is we get to keep the open world. This
> isn't only about compatibility, open worlds are just better. The most
> popular way to get software to desktop machines is Chrome and the web
> is totally open world. Apps are downloaded incrementally as the user
> navigates around, and companies exploit this fact aggressively. Large
> web sites can be far larger than would be considered practical to
> distribute to end user machines, and can easily update 50 times a day.
> Web developers have to think about latency on specific interactions,
> but they don't have to think about the size of the entire app and that
> allows them to scale up feature sets as fast as funding allows. In
> contrast the closed world mobile versions of their sites are a parade
> of horror stories in which firms have to e.g. hotpatch Dalvik to work
> around method count limits (Facebook), or in which code size issues
> nearly wrecked the entire company (Uber):
>
> https://twitter.com/StanTwinB/status/1336914412708405248
>
> Right now code size isn't a particularly serious problem for us, but
> the ease of including open source libraries means footprint grows all
> the time. Especially for our shell scripting tool, there are tons of
> cool features that could be added but if we did all of them we'd
> probably end up with 500mb of bytecode. With an open world features
> can be downloaded on the fly as they get used and you can build a
> plugin ecosystem.
>
> The new more incremental direction of Leyden is thus welcomed and
> appreciated, because it feels like a lot of ground can be covered by
> "small" changes like upgrading AppCDS and caching compiled hotspots.
> Even if the results aren't as impressive as with native-image, the
> benefits of keeping an open world can probably make up for it, at
> least for our use cases.



More information about the leyden-dev mailing list