AppCDS / AOT thoughts based on CLI app experience

Fri Jun 3 00:30:39 UTC 2022

On 6/2/2022 5:15 PM, Ioi Lam wrote:
> Hi Mike,
>
> I am thrilled to hear that you're happy with CDS. Please see my 
> responses below.
>
> If you have other questions or requests for CDS, please let me know :-)
>
> On 6/1/2022 6:03 AM, Mike Hearn wrote:
>> Hi,
>>
>> It feels like most of the interest in static Java comes from the
>> microservices / functions-as-a-service community. My new company spent
>> the last year creating a developer tool that runs on the JVM (which
>> will be useful for Java developers actually, but what it does is
>> irrelevant here). Internally it's a kind of build system and is thus a
>> large(ish) CLI app in which startup time and throughput are what
>> matter most. We also have a separate internal tool that uses Kotlin
>> scripting to implement a bash-like scripting language, and which is
>> sensitive in the same ways.
>>
>> Today the JVM is often overlooked for writing CLI apps due to startup
>> time, 'lightness' and packaging issues. I figured I'd write down some
>> notes based on our experiences. They cover workflow, performance,
>> implementation costs and security issues. Hopefully it's helpful.
>>
>> 1.
>>
>> I really like AppCDS because:
>>
>> a. It can't break the app so switching it on/off a no-brainer. Unlike
>> native-image/static java, no additional testing overhead is created by
>> it.
>>
>> b. It's effective even without heap snapshotting. We see a ~40%
>> speedup for executing --help
>>
>> c. It's pay-as-you-go. We can use a small archive that's fast to
>> create to accelerate just the most latency sensitive startup paths, or
>> we can use it for the whole app, but ultimately costs are
>> controllable.
>>
>> d. Archives are deterministic. Modern client-side packaging systems
>> support delta updates, and CDS plays nicely with them. GraalVM native
>> images are non-deterministic so every update is going to replace the
>> entire app, which isn't much fun from an update speed or bandwidth
>> consumption perspective.
>>
>> Startup time is dominated by PicoCLI which is a common problem for
>> Java CLI apps. Supposedly the slowest part is building the model of
>> the CLI interface using reflection, so it's a perfect candidate for
>> AppCDS heap snapshotting.  I say supposedly, because I haven't seen
>> concrete evidence that this is actually where the time goes, but it
>> seems like a plausible belief. There's a long standing bug filed to
>> replace reflection with code generation but it's a big job and so
>> nobody did it.
>>
>> Unfortunately the app will ship without using AppCDS. Some workflow
>> issues remain. These can be solved in the app itself, but it'd be nice
>> if the JVM does it.
>>
>> The obvious way to use CDS is to ship an archive with the app. We
>> might do this as a first iteration, but longer term don't want to for
>> two reasons:
>>
>> a. The archive can get huge.
>> b. Signature verification penalties on macOS (see below).
>>
>> For just making --help and similar short commands faster size isn't so
>> bad (~6-10mb for us), but if it's used for a whole execution the
>> archive size for a standard run is nearly the same as total bytecode
>> size of the app. As more stuff gets cached this will get worse.
>> Download size might not matter much for this particular app, but as a
>> general principle it does. So a nice improvement would be to generate
>> it client side.
>>
>> CDS files are caches and different platforms have different
>> conventions for where those go. The JVM doesn't know about those
>> conventions but our app does, so we'd need our custom native code
>> launcher (which exists anyway for other reasons) to set the right
>> paths for CDS.
>>
>> Then you have to pick the right flags depending on whether the CDS
>> file exists or not. I follow CDS related changes and believe this is
>> fixed in latest Java versions but maybe (?) not released yet.
>
> Which version of Java are you using?
>
> Since JDK 11, the default value of -Xshare is set to -Xshare:auto, so 
> you can always do this:
>
> $ java -XX:SharedArchiveFile=nosuch.jsa -version
> java version "11" 2018-09-25
> Java(TM) SE Runtime Environment 18.9 (build 11+28)
> Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)
>
> If the file exists, it will be used automatically. Otherwise the VM 
> will silently ignore the archive.
>
> Since JDK 17, a default CDS archive is shipped with the JDK. So you 
> will at least get some performance benefits of CDS for the built-in 
> classes.
>
> With the upcoming JDK 19, we have implemented a new feature (See 
> JDK-8261455) to automatically create the CDS archive. Here's an 
> example (I am using Javac because it's convenient, but you need to 
> quote the JVM parameters with -J):
>
> $ javac -J-XX:+AutoCreateSharedArchive 
> -J-XX:SharedArchiveFile=javac.jsa HelloWorld.java
>
> javac.jsa will be automatically created if it doesn't exist, or if 
> it's not compatible with the JVM (e.g., if you have upgraded to a 
> newer JDK).
>
> In this case, the total elapsed time is improved from about 522ms 
> (with default CDS archive) to 330ms (auto-generated archive).
>
>
>> Even once that's fixed it's not quite obvious that we'd use it. The
>> JVM runs much slower when dumping a dynamic CDS archive and the first
>> run is when first impressions are made. Whilst for cloud stuff this is
>> a matter of (artificially?) expensive resources, for CLI apps it's
>> about more subjective things like feeling snappy. One idea is to delay
>> dumping a CDS archive until after the first run is exiting, so it
>> doesn't get in the way. The first run wouldn't benefit from the
>> archive which is a pity (except on Linux where the package managers
>> make it easy to run code post-install), but it at least wouldn't be
>> slowed down by creating it either. The native launcher can schedule
>> this. Alternatively there could be a brief pause on first run when the
>> user is told explicitly that the app is optimizing itself, but how
>> feasible that is depends very much on dump speed. Finally we could
>> ship a small archive that only covers startup, and then in parallel
>> make a dump of a full run in the background.
>
> The dynamic CDS dumping happens when the JVM exits. We could ... (just 
> throwing half-baked ideas) spawn a new daemon subprocess to do the 
> dumping, while the main JVM process exits. So to the user there's no 
> penalty.
>
>
>> Speaking of which, there's a need for some protocol to drive an app
>> through a representative 'trial run'. Whether it's generating the
>> class list or the archive itself, it could be as simple as an
>> alternative static method that sits next to main.

One thing you *could* do with JDK 19 on Linux is:

java -XX:+AutoCreateSharedArchive -XX:SharedArchiveFile=app.jsa -jar MyApp

In you main method, check the /proc/self/maps file to see if app.jsa is 
mapped. If not, the VM is dumping the dynamic CDS archive. In this case, 
your app can run in a special "trial run" mode that exercises different 
functionalities.

To make this easier to use, we could add a special system property, 
something like "jdk.cds.is.dumping", that can be queried by the application.

Thanks
- Ioi

>> If it were to be
>> standardized the rest of the infrastructure becomes more re-usable,
>> for instance build systems can take care of generating classlists, or
>> the end-user packaging can take care of dynamic dumping.
>
> Maybe we could have some sort of daemon that collects profiling data 
> in the background, and update the archives when the application 
> behavior is more understood.
>
>> CDS has two modes and it's not clear which is better. I'm unusually
>> obsessive about this stuff to the extent of reading the CDS source
>> code, but despite that I have absolutely no idea if I should be trying
>> to use static or dynamic archives. There used to be a performance
>> difference between them but maybe it's fixed now? There's a lack of
>> end-to-end guidance on how to exploit this feature best.
>
> I agree our documentation is kind of lacking. We'll try to improve it.
>
> Static and dynamic archives will be roughly the same speed (~10 ms 
> faster with static dump for the javac example above).
>
> The dynamic archive will be smaller, because it doesn't need to 
> duplicate the built-in classes that are already in the static archive. 
> Here's a size comparison for javac.jsa
>
> static:  20,217,856 bytes
> dynamic: 10,153,984 bytes
>
>
>> The ideal would obviously be losing the dump/exec split and make
>> dynamic dumping continuous, incremental and imposing no performance
>> penalty. Then we could just supply a path to where the CDS file should
>> go and things magically warm up across executions. I have no idea how
>> feasible that is.
>>
>> Once AppCDS archives are in place and being created at the right
>> times, a @Snapshotted annotation for fields (or similar) should be an
>> easy win to eliminate the bulk of the rest of the PicoCLI time.
>> Dynamically loaded heaps would also be useful to eliminate the
>> overhead of loading configs and instantiating the (build) task graph
>> without a Gradle-style daemon.
>>
>> 2.
>>
>> AppCDS archives can open a subtle security issue when distributing
>> code to desktop platforms. Because they're full of vtables anyone who
>> can write to them can (we assume) take over any JVM that loads the
>> archive and gain whatever privileges have been granted to that app.
>> The archive file is fully trusted.
>
> Will you have a similar problem if the JAR file of the application is 
> maliciously modified?
>
> Actually the vtables inside the CDS archive file contain all zeros, 
> and are filled in by the VM after the archive is mapped.
>
> What could be modified is the vtptr of archived MetaData objects. They 
> usually point to somewhere near 0x800000000 (where the vtables are) 
> but the attacker could modify them to point to arbitrary locations. I 
> am not sure if this type of attack is easier than modifying the JAR 
> files, or not.
>
> Thanks
> - Ioi
>
>
>>
>> On Windows and Linux this doesn't matter. On Linux sensitive files can
>> be packaged or created in postinst scripts. On Windows either an app
>> comes with a legacy installer/MSI file and thus doesn't have any
>> recognized package identity that can be granted extra permissions, or
>> it uses the current gen MSIX system. In the latter case Windows has a
>> notion of app identity and so you can request permissions to access
>> e.g. keychain entries, the user's calendar etc, but in that case
>> Windows also gives you a private directory that's protected from other
>> apps where sensitive files can be stashed. AppCDS archives can go
>> there and we're done.
>>
>> MacOS is a problem child. There are two situations that matter.
>>
>> In the first case archives are shipped as data files with the app.
>> Security is not an issue here, but there's a subtle performance
>> footgun. On most platforms signatures of files shipped with an app are
>> checked at install time but on macOS they aren't. Thanks to its NeXT
>> roots it doesn't really have an installation concept, and thus the
>> kernel checks signatures of files on first use then caches the
>> signature check in the kernel vnode. By default the entire file is
>> hashed in order to link it back to the root signature, which for large
>> files can impose a small but noticeable delay before the app can open
>> them. This first run penalty is unfortunate given that AppCDS exists
>> partly to improve startup time. You can argue it doesn't matter much
>> due to the caching, but it's worth being aware of - very large AppCDS
>> archives would get fully paged in and hashed before the app even gets
>> to do anything. In turn that means people might enable AppCDS with a
>> big classlist expecting it to speed things up, not noticing that for
>> Mac users only it slowed things down instead. There are ways to fix
>> this using supported Apple APIs. One is to supply a CodeDirectory
>> structure stored in extended attributes: you should get incremental
>> hashing and normal page fault behaviour (untested!). Another is to
>> wrap the data in a Mach-O file.
>>
>> In the second case the CDS archive is being generated client side. Mac
>> apps don't have anywhere they can create tamperproof data, except for
>> very small amounts in the keychain. Thus if a Mac app opens a
>> malicious cache file that can take control of it that's a security
>> bug, because it'd allow one program to grab any special privileges the
>> user granted to another. The fact that the grabbing program has passed
>> GateKeeper and notarization doesn't necessarily matter (Apple's
>> guidance on this is unclear, but it seems plausible that this is their
>> stance). In this case the key chain can be used as a root of trust by
>> storing a hash of the CDS archive in it and checking that after
>> mmap/before use. Alternatively, again, Apple provides an API that lets
>> you associate an on-disk (xattr) CodeDirectory structure with a file
>> which will then be checked incrementally at page fault time. Extreme
>> care must be taken to avoid race conditions, but in theory, a
>> CodeDirectory structure can be computed at dump time, written to disk
>> as an xattr, and then stored again in the key chain (e.g. by
>> pretending it's a "key" or "password"). After the security API is
>> instructed to associate a CD with the file, it can be checked against
>> the tamperproofed version stored in the key chain and if they match,
>> the archive can then be mmapped and used as normal.
>>
>> Native images don't have these issues because the state snapshot is
>> stored inside the Mach-O file and thus gets covered by the normal
>> mechanisms. However once it adds support for persisted heaps, the same
>> issue may arise.
>>
>> Whether it's worth doing the extra work to solve this is unclear. Macs
>> are guaranteed to come with very fast NVMe disks and CPUs. Still, it's
>> worth being aware of the issue.
>>
>> 3.
>>
>> Why not just use a native image then? Maybe we'll do that because the
>> performance wins are really compelling, but again, v1 will ship
>> without this for the following reasons:
>>
>> a. Static minification can break things. Our integration tests
>> currently invoke the entry point of the app directly, but that could
>> be fixed to run the tool in an external process. For unit tests the
>> situation is far murkier. It's a bit unclear how to run JUnit tests
>> against the statically compiled version and it may not even make sense
>> (because the tests would pin a bunch of code that might get stripped
>> in the real app so what are you really testing?).
>>
>> b. It'd break delta updates. Not the end of the world, but a factor.
>>
>> c. I have no idea if we're using any libraries that spin bytecode
>> dynamically. Even if we're not today, what if tomorrow we want to use
>> such a library? Do we have to avoid using it and increase the cost of
>> feature development, or roll back the native image and give our users
>> a nasty performance downgrade? Neither option is attractive. Ideally
>> SubstrateVM would contain a bytecode interpreter and use it when
>> necessary. Lots of issues there but e.g. it'd probably be OK if it's
>> not a general classloader and the code dependencies have to be known
>> AOT.
>>
>> d. Similar to (c), fully AOT compilation can explode code and thus
>> download size even though many codepaths are cold and only execute
>> once. It'd be nice if a native image could include a mix of bytecode
>> and AOT compiled hotspots.
>>
>> e. Once you're past the initial interactive stage the program is
>> throughput sensitive. How much of a perf downgrade over HotSpot would
>> we get, if any? With GraalVM EE we could use PGO and not lose any, but
>> the ISV pricing is opaque. At any rate to answer this we have to fix
>> the compatibility issues first. The prospect of improving startup time
>> and then discovering we slowed down the actual builds isn't really
>> appealing (though I suspect in our case AOT wouldn't really hurt
>> much).
>>
>> f. What if we want to support in-process plugins? Maybe we can use
>> Espresso, but this is a road less travelled (lack of tutorials, well
>> documented examples etc).
>>
>> An interesting possibility is using a mix of approaches. For the bash
>> competitor I mentioned earlier dynamic code loading is needed because
>> the script bytecode is loaded into the host JVM, but the Kotlin
>> compiler itself could theoretically be statically compiled to a JNI or
>> Panama-accessible library. We tried this before and hit compatibility
>> errors, but didn't make any effort to resolve them.
>>
>> 4.
>>
>> What about CRaC? It's Linux only so isn't interesting to us, given
>> that most devs are on Windows/macOS. The benefits for Linux servers
>> are clear though. Obvious question - can you make a snapshot on one
>> machine/Linux distro, and resume them on a totally different one, or
>> does it require a homogenous infrastructure?
>>
>> 5.
>>
>> A big reason AppCDS is nice is we get to keep the open world. This
>> isn't only about compatibility, open worlds are just better. The most
>> popular way to get software to desktop machines is Chrome and the web
>> is totally open world. Apps are downloaded incrementally as the user
>> navigates around, and companies exploit this fact aggressively. Large
>> web sites can be far larger than would be considered practical to
>> distribute to end user machines, and can easily update 50 times a day.
>> Web developers have to think about latency on specific interactions,
>> but they don't have to think about the size of the entire app and that
>> allows them to scale up feature sets as fast as funding allows. In
>> contrast the closed world mobile versions of their sites are a parade
>> of horror stories in which firms have to e.g. hotpatch Dalvik to work
>> around method count limits (Facebook), or in which code size issues
>> nearly wrecked the entire company (Uber):
>>
>> https://twitter.com/StanTwinB/status/1336914412708405248
>>
>> Right now code size isn't a particularly serious problem for us, but
>> the ease of including open source libraries means footprint grows all
>> the time. Especially for our shell scripting tool, there are tons of
>> cool features that could be added but if we did all of them we'd
>> probably end up with 500mb of bytecode. With an open world features
>> can be downloaded on the fly as they get used and you can build a
>> plugin ecosystem.
>>
>> The new more incremental direction of Leyden is thus welcomed and
>> appreciated, because it feels like a lot of ground can be covered by
>> "small" changes like upgrading AppCDS and caching compiled hotspots.
>> Even if the results aren't as impressive as with native-image, the
>> benefits of keeping an open world can probably make up for it, at
>> least for our use cases.
>