AppCDS / AOT thoughts based on CLI app experience
Hi, It feels like most of the interest in static Java comes from the microservices / functions-as-a-service community. My new company spent the last year creating a developer tool that runs on the JVM (which will be useful for Java developers actually, but what it does is irrelevant here). Internally it's a kind of build system and is thus a large(ish) CLI app in which startup time and throughput are what matter most. We also have a separate internal tool that uses Kotlin scripting to implement a bash-like scripting language, and which is sensitive in the same ways. Today the JVM is often overlooked for writing CLI apps due to startup time, 'lightness' and packaging issues. I figured I'd write down some notes based on our experiences. They cover workflow, performance, implementation costs and security issues. Hopefully it's helpful. 1. I really like AppCDS because: a. It can't break the app so switching it on/off a no-brainer. Unlike native-image/static java, no additional testing overhead is created by it. b. It's effective even without heap snapshotting. We see a ~40% speedup for executing --help c. It's pay-as-you-go. We can use a small archive that's fast to create to accelerate just the most latency sensitive startup paths, or we can use it for the whole app, but ultimately costs are controllable. d. Archives are deterministic. Modern client-side packaging systems support delta updates, and CDS plays nicely with them. GraalVM native images are non-deterministic so every update is going to replace the entire app, which isn't much fun from an update speed or bandwidth consumption perspective. Startup time is dominated by PicoCLI which is a common problem for Java CLI apps. Supposedly the slowest part is building the model of the CLI interface using reflection, so it's a perfect candidate for AppCDS heap snapshotting. I say supposedly, because I haven't seen concrete evidence that this is actually where the time goes, but it seems like a plausible belief. There's a long standing bug filed to replace reflection with code generation but it's a big job and so nobody did it. Unfortunately the app will ship without using AppCDS. Some workflow issues remain. These can be solved in the app itself, but it'd be nice if the JVM does it. The obvious way to use CDS is to ship an archive with the app. We might do this as a first iteration, but longer term don't want to for two reasons: a. The archive can get huge. b. Signature verification penalties on macOS (see below). For just making --help and similar short commands faster size isn't so bad (~6-10mb for us), but if it's used for a whole execution the archive size for a standard run is nearly the same as total bytecode size of the app. As more stuff gets cached this will get worse. Download size might not matter much for this particular app, but as a general principle it does. So a nice improvement would be to generate it client side. CDS files are caches and different platforms have different conventions for where those go. The JVM doesn't know about those conventions but our app does, so we'd need our custom native code launcher (which exists anyway for other reasons) to set the right paths for CDS. Then you have to pick the right flags depending on whether the CDS file exists or not. I follow CDS related changes and believe this is fixed in latest Java versions but maybe (?) not released yet. Even once that's fixed it's not quite obvious that we'd use it. The JVM runs much slower when dumping a dynamic CDS archive and the first run is when first impressions are made. Whilst for cloud stuff this is a matter of (artificially?) expensive resources, for CLI apps it's about more subjective things like feeling snappy. One idea is to delay dumping a CDS archive until after the first run is exiting, so it doesn't get in the way. The first run wouldn't benefit from the archive which is a pity (except on Linux where the package managers make it easy to run code post-install), but it at least wouldn't be slowed down by creating it either. The native launcher can schedule this. Alternatively there could be a brief pause on first run when the user is told explicitly that the app is optimizing itself, but how feasible that is depends very much on dump speed. Finally we could ship a small archive that only covers startup, and then in parallel make a dump of a full run in the background. Speaking of which, there's a need for some protocol to drive an app through a representative 'trial run'. Whether it's generating the class list or the archive itself, it could be as simple as an alternative static method that sits next to main. If it were to be standardized the rest of the infrastructure becomes more re-usable, for instance build systems can take care of generating classlists, or the end-user packaging can take care of dynamic dumping. CDS has two modes and it's not clear which is better. I'm unusually obsessive about this stuff to the extent of reading the CDS source code, but despite that I have absolutely no idea if I should be trying to use static or dynamic archives. There used to be a performance difference between them but maybe it's fixed now? There's a lack of end-to-end guidance on how to exploit this feature best. The ideal would obviously be losing the dump/exec split and make dynamic dumping continuous, incremental and imposing no performance penalty. Then we could just supply a path to where the CDS file should go and things magically warm up across executions. I have no idea how feasible that is. Once AppCDS archives are in place and being created at the right times, a @Snapshotted annotation for fields (or similar) should be an easy win to eliminate the bulk of the rest of the PicoCLI time. Dynamically loaded heaps would also be useful to eliminate the overhead of loading configs and instantiating the (build) task graph without a Gradle-style daemon. 2. AppCDS archives can open a subtle security issue when distributing code to desktop platforms. Because they're full of vtables anyone who can write to them can (we assume) take over any JVM that loads the archive and gain whatever privileges have been granted to that app. The archive file is fully trusted. On Windows and Linux this doesn't matter. On Linux sensitive files can be packaged or created in postinst scripts. On Windows either an app comes with a legacy installer/MSI file and thus doesn't have any recognized package identity that can be granted extra permissions, or it uses the current gen MSIX system. In the latter case Windows has a notion of app identity and so you can request permissions to access e.g. keychain entries, the user's calendar etc, but in that case Windows also gives you a private directory that's protected from other apps where sensitive files can be stashed. AppCDS archives can go there and we're done. MacOS is a problem child. There are two situations that matter. In the first case archives are shipped as data files with the app. Security is not an issue here, but there's a subtle performance footgun. On most platforms signatures of files shipped with an app are checked at install time but on macOS they aren't. Thanks to its NeXT roots it doesn't really have an installation concept, and thus the kernel checks signatures of files on first use then caches the signature check in the kernel vnode. By default the entire file is hashed in order to link it back to the root signature, which for large files can impose a small but noticeable delay before the app can open them. This first run penalty is unfortunate given that AppCDS exists partly to improve startup time. You can argue it doesn't matter much due to the caching, but it's worth being aware of - very large AppCDS archives would get fully paged in and hashed before the app even gets to do anything. In turn that means people might enable AppCDS with a big classlist expecting it to speed things up, not noticing that for Mac users only it slowed things down instead. There are ways to fix this using supported Apple APIs. One is to supply a CodeDirectory structure stored in extended attributes: you should get incremental hashing and normal page fault behaviour (untested!). Another is to wrap the data in a Mach-O file. In the second case the CDS archive is being generated client side. Mac apps don't have anywhere they can create tamperproof data, except for very small amounts in the keychain. Thus if a Mac app opens a malicious cache file that can take control of it that's a security bug, because it'd allow one program to grab any special privileges the user granted to another. The fact that the grabbing program has passed GateKeeper and notarization doesn't necessarily matter (Apple's guidance on this is unclear, but it seems plausible that this is their stance). In this case the key chain can be used as a root of trust by storing a hash of the CDS archive in it and checking that after mmap/before use. Alternatively, again, Apple provides an API that lets you associate an on-disk (xattr) CodeDirectory structure with a file which will then be checked incrementally at page fault time. Extreme care must be taken to avoid race conditions, but in theory, a CodeDirectory structure can be computed at dump time, written to disk as an xattr, and then stored again in the key chain (e.g. by pretending it's a "key" or "password"). After the security API is instructed to associate a CD with the file, it can be checked against the tamperproofed version stored in the key chain and if they match, the archive can then be mmapped and used as normal. Native images don't have these issues because the state snapshot is stored inside the Mach-O file and thus gets covered by the normal mechanisms. However once it adds support for persisted heaps, the same issue may arise. Whether it's worth doing the extra work to solve this is unclear. Macs are guaranteed to come with very fast NVMe disks and CPUs. Still, it's worth being aware of the issue. 3. Why not just use a native image then? Maybe we'll do that because the performance wins are really compelling, but again, v1 will ship without this for the following reasons: a. Static minification can break things. Our integration tests currently invoke the entry point of the app directly, but that could be fixed to run the tool in an external process. For unit tests the situation is far murkier. It's a bit unclear how to run JUnit tests against the statically compiled version and it may not even make sense (because the tests would pin a bunch of code that might get stripped in the real app so what are you really testing?). b. It'd break delta updates. Not the end of the world, but a factor. c. I have no idea if we're using any libraries that spin bytecode dynamically. Even if we're not today, what if tomorrow we want to use such a library? Do we have to avoid using it and increase the cost of feature development, or roll back the native image and give our users a nasty performance downgrade? Neither option is attractive. Ideally SubstrateVM would contain a bytecode interpreter and use it when necessary. Lots of issues there but e.g. it'd probably be OK if it's not a general classloader and the code dependencies have to be known AOT. d. Similar to (c), fully AOT compilation can explode code and thus download size even though many codepaths are cold and only execute once. It'd be nice if a native image could include a mix of bytecode and AOT compiled hotspots. e. Once you're past the initial interactive stage the program is throughput sensitive. How much of a perf downgrade over HotSpot would we get, if any? With GraalVM EE we could use PGO and not lose any, but the ISV pricing is opaque. At any rate to answer this we have to fix the compatibility issues first. The prospect of improving startup time and then discovering we slowed down the actual builds isn't really appealing (though I suspect in our case AOT wouldn't really hurt much). f. What if we want to support in-process plugins? Maybe we can use Espresso, but this is a road less travelled (lack of tutorials, well documented examples etc). An interesting possibility is using a mix of approaches. For the bash competitor I mentioned earlier dynamic code loading is needed because the script bytecode is loaded into the host JVM, but the Kotlin compiler itself could theoretically be statically compiled to a JNI or Panama-accessible library. We tried this before and hit compatibility errors, but didn't make any effort to resolve them. 4. What about CRaC? It's Linux only so isn't interesting to us, given that most devs are on Windows/macOS. The benefits for Linux servers are clear though. Obvious question - can you make a snapshot on one machine/Linux distro, and resume them on a totally different one, or does it require a homogenous infrastructure? 5. A big reason AppCDS is nice is we get to keep the open world. This isn't only about compatibility, open worlds are just better. The most popular way to get software to desktop machines is Chrome and the web is totally open world. Apps are downloaded incrementally as the user navigates around, and companies exploit this fact aggressively. Large web sites can be far larger than would be considered practical to distribute to end user machines, and can easily update 50 times a day. Web developers have to think about latency on specific interactions, but they don't have to think about the size of the entire app and that allows them to scale up feature sets as fast as funding allows. In contrast the closed world mobile versions of their sites are a parade of horror stories in which firms have to e.g. hotpatch Dalvik to work around method count limits (Facebook), or in which code size issues nearly wrecked the entire company (Uber): https://twitter.com/StanTwinB/status/1336914412708405248 Right now code size isn't a particularly serious problem for us, but the ease of including open source libraries means footprint grows all the time. Especially for our shell scripting tool, there are tons of cool features that could be added but if we did all of them we'd probably end up with 500mb of bytecode. With an open world features can be downloaded on the fly as they get used and you can build a plugin ecosystem. The new more incremental direction of Leyden is thus welcomed and appreciated, because it feels like a lot of ground can be covered by "small" changes like upgrading AppCDS and caching compiled hotspots. Even if the results aren't as impressive as with native-image, the benefits of keeping an open world can probably make up for it, at least for our use cases.
Hi Mike, Thanks very much for that extremely valuable input, in particular the very clear breakdown of the swings and roundabouts you have noted when it comes to using CDS/AppCDS or native Java vs the vanilla dynamic JVM. It is very important that project Leyden considers the whole development and deployment cycle, not just the size and startup time/footprint of the delivered static Java executable (indeed, Dan Heidinga and I just published an article about this topic on InfoQ that you might find relevant). Your comment about AppCDS being "pay-as-you-go" resonated most strongly. I hope that one of the pay-offs of the incremental approach Mark has recommended for the project will be the ability to provide "pay-as-you-go" improvements in startup time and footprint where a user can balance development benefits and costs against those arising at deployment time. regards, Andrew Dinn ----------- Red Hat Distinguished Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill On 01/06/2022 14:03, Mike Hearn wrote:
Hi,
It feels like most of the interest in static Java comes from the microservices / functions-as-a-service community. My new company spent the last year creating a developer tool that runs on the JVM (which will be useful for Java developers actually, but what it does is irrelevant here). Internally it's a kind of build system and is thus a large(ish) CLI app in which startup time and throughput are what matter most. We also have a separate internal tool that uses Kotlin scripting to implement a bash-like scripting language, and which is sensitive in the same ways.
Today the JVM is often overlooked for writing CLI apps due to startup time, 'lightness' and packaging issues. I figured I'd write down some notes based on our experiences. They cover workflow, performance, implementation costs and security issues. Hopefully it's helpful.
1.
I really like AppCDS because:
a. It can't break the app so switching it on/off a no-brainer. Unlike native-image/static java, no additional testing overhead is created by it.
b. It's effective even without heap snapshotting. We see a ~40% speedup for executing --help
c. It's pay-as-you-go. We can use a small archive that's fast to create to accelerate just the most latency sensitive startup paths, or we can use it for the whole app, but ultimately costs are controllable.
d. Archives are deterministic. Modern client-side packaging systems support delta updates, and CDS plays nicely with them. GraalVM native images are non-deterministic so every update is going to replace the entire app, which isn't much fun from an update speed or bandwidth consumption perspective.
Startup time is dominated by PicoCLI which is a common problem for Java CLI apps. Supposedly the slowest part is building the model of the CLI interface using reflection, so it's a perfect candidate for AppCDS heap snapshotting. I say supposedly, because I haven't seen concrete evidence that this is actually where the time goes, but it seems like a plausible belief. There's a long standing bug filed to replace reflection with code generation but it's a big job and so nobody did it.
Unfortunately the app will ship without using AppCDS. Some workflow issues remain. These can be solved in the app itself, but it'd be nice if the JVM does it.
The obvious way to use CDS is to ship an archive with the app. We might do this as a first iteration, but longer term don't want to for two reasons:
a. The archive can get huge. b. Signature verification penalties on macOS (see below).
For just making --help and similar short commands faster size isn't so bad (~6-10mb for us), but if it's used for a whole execution the archive size for a standard run is nearly the same as total bytecode size of the app. As more stuff gets cached this will get worse. Download size might not matter much for this particular app, but as a general principle it does. So a nice improvement would be to generate it client side.
CDS files are caches and different platforms have different conventions for where those go. The JVM doesn't know about those conventions but our app does, so we'd need our custom native code launcher (which exists anyway for other reasons) to set the right paths for CDS.
Then you have to pick the right flags depending on whether the CDS file exists or not. I follow CDS related changes and believe this is fixed in latest Java versions but maybe (?) not released yet.
Even once that's fixed it's not quite obvious that we'd use it. The JVM runs much slower when dumping a dynamic CDS archive and the first run is when first impressions are made. Whilst for cloud stuff this is a matter of (artificially?) expensive resources, for CLI apps it's about more subjective things like feeling snappy. One idea is to delay dumping a CDS archive until after the first run is exiting, so it doesn't get in the way. The first run wouldn't benefit from the archive which is a pity (except on Linux where the package managers make it easy to run code post-install), but it at least wouldn't be slowed down by creating it either. The native launcher can schedule this. Alternatively there could be a brief pause on first run when the user is told explicitly that the app is optimizing itself, but how feasible that is depends very much on dump speed. Finally we could ship a small archive that only covers startup, and then in parallel make a dump of a full run in the background.
Speaking of which, there's a need for some protocol to drive an app through a representative 'trial run'. Whether it's generating the class list or the archive itself, it could be as simple as an alternative static method that sits next to main. If it were to be standardized the rest of the infrastructure becomes more re-usable, for instance build systems can take care of generating classlists, or the end-user packaging can take care of dynamic dumping.
CDS has two modes and it's not clear which is better. I'm unusually obsessive about this stuff to the extent of reading the CDS source code, but despite that I have absolutely no idea if I should be trying to use static or dynamic archives. There used to be a performance difference between them but maybe it's fixed now? There's a lack of end-to-end guidance on how to exploit this feature best.
The ideal would obviously be losing the dump/exec split and make dynamic dumping continuous, incremental and imposing no performance penalty. Then we could just supply a path to where the CDS file should go and things magically warm up across executions. I have no idea how feasible that is.
Once AppCDS archives are in place and being created at the right times, a @Snapshotted annotation for fields (or similar) should be an easy win to eliminate the bulk of the rest of the PicoCLI time. Dynamically loaded heaps would also be useful to eliminate the overhead of loading configs and instantiating the (build) task graph without a Gradle-style daemon.
2.
AppCDS archives can open a subtle security issue when distributing code to desktop platforms. Because they're full of vtables anyone who can write to them can (we assume) take over any JVM that loads the archive and gain whatever privileges have been granted to that app. The archive file is fully trusted.
On Windows and Linux this doesn't matter. On Linux sensitive files can be packaged or created in postinst scripts. On Windows either an app comes with a legacy installer/MSI file and thus doesn't have any recognized package identity that can be granted extra permissions, or it uses the current gen MSIX system. In the latter case Windows has a notion of app identity and so you can request permissions to access e.g. keychain entries, the user's calendar etc, but in that case Windows also gives you a private directory that's protected from other apps where sensitive files can be stashed. AppCDS archives can go there and we're done.
MacOS is a problem child. There are two situations that matter.
In the first case archives are shipped as data files with the app. Security is not an issue here, but there's a subtle performance footgun. On most platforms signatures of files shipped with an app are checked at install time but on macOS they aren't. Thanks to its NeXT roots it doesn't really have an installation concept, and thus the kernel checks signatures of files on first use then caches the signature check in the kernel vnode. By default the entire file is hashed in order to link it back to the root signature, which for large files can impose a small but noticeable delay before the app can open them. This first run penalty is unfortunate given that AppCDS exists partly to improve startup time. You can argue it doesn't matter much due to the caching, but it's worth being aware of - very large AppCDS archives would get fully paged in and hashed before the app even gets to do anything. In turn that means people might enable AppCDS with a big classlist expecting it to speed things up, not noticing that for Mac users only it slowed things down instead. There are ways to fix this using supported Apple APIs. One is to supply a CodeDirectory structure stored in extended attributes: you should get incremental hashing and normal page fault behaviour (untested!). Another is to wrap the data in a Mach-O file.
In the second case the CDS archive is being generated client side. Mac apps don't have anywhere they can create tamperproof data, except for very small amounts in the keychain. Thus if a Mac app opens a malicious cache file that can take control of it that's a security bug, because it'd allow one program to grab any special privileges the user granted to another. The fact that the grabbing program has passed GateKeeper and notarization doesn't necessarily matter (Apple's guidance on this is unclear, but it seems plausible that this is their stance). In this case the key chain can be used as a root of trust by storing a hash of the CDS archive in it and checking that after mmap/before use. Alternatively, again, Apple provides an API that lets you associate an on-disk (xattr) CodeDirectory structure with a file which will then be checked incrementally at page fault time. Extreme care must be taken to avoid race conditions, but in theory, a CodeDirectory structure can be computed at dump time, written to disk as an xattr, and then stored again in the key chain (e.g. by pretending it's a "key" or "password"). After the security API is instructed to associate a CD with the file, it can be checked against the tamperproofed version stored in the key chain and if they match, the archive can then be mmapped and used as normal.
Native images don't have these issues because the state snapshot is stored inside the Mach-O file and thus gets covered by the normal mechanisms. However once it adds support for persisted heaps, the same issue may arise.
Whether it's worth doing the extra work to solve this is unclear. Macs are guaranteed to come with very fast NVMe disks and CPUs. Still, it's worth being aware of the issue.
3.
Why not just use a native image then? Maybe we'll do that because the performance wins are really compelling, but again, v1 will ship without this for the following reasons:
a. Static minification can break things. Our integration tests currently invoke the entry point of the app directly, but that could be fixed to run the tool in an external process. For unit tests the situation is far murkier. It's a bit unclear how to run JUnit tests against the statically compiled version and it may not even make sense (because the tests would pin a bunch of code that might get stripped in the real app so what are you really testing?).
b. It'd break delta updates. Not the end of the world, but a factor.
c. I have no idea if we're using any libraries that spin bytecode dynamically. Even if we're not today, what if tomorrow we want to use such a library? Do we have to avoid using it and increase the cost of feature development, or roll back the native image and give our users a nasty performance downgrade? Neither option is attractive. Ideally SubstrateVM would contain a bytecode interpreter and use it when necessary. Lots of issues there but e.g. it'd probably be OK if it's not a general classloader and the code dependencies have to be known AOT.
d. Similar to (c), fully AOT compilation can explode code and thus download size even though many codepaths are cold and only execute once. It'd be nice if a native image could include a mix of bytecode and AOT compiled hotspots.
e. Once you're past the initial interactive stage the program is throughput sensitive. How much of a perf downgrade over HotSpot would we get, if any? With GraalVM EE we could use PGO and not lose any, but the ISV pricing is opaque. At any rate to answer this we have to fix the compatibility issues first. The prospect of improving startup time and then discovering we slowed down the actual builds isn't really appealing (though I suspect in our case AOT wouldn't really hurt much).
f. What if we want to support in-process plugins? Maybe we can use Espresso, but this is a road less travelled (lack of tutorials, well documented examples etc).
An interesting possibility is using a mix of approaches. For the bash competitor I mentioned earlier dynamic code loading is needed because the script bytecode is loaded into the host JVM, but the Kotlin compiler itself could theoretically be statically compiled to a JNI or Panama-accessible library. We tried this before and hit compatibility errors, but didn't make any effort to resolve them.
4.
What about CRaC? It's Linux only so isn't interesting to us, given that most devs are on Windows/macOS. The benefits for Linux servers are clear though. Obvious question - can you make a snapshot on one machine/Linux distro, and resume them on a totally different one, or does it require a homogenous infrastructure?
5.
A big reason AppCDS is nice is we get to keep the open world. This isn't only about compatibility, open worlds are just better. The most popular way to get software to desktop machines is Chrome and the web is totally open world. Apps are downloaded incrementally as the user navigates around, and companies exploit this fact aggressively. Large web sites can be far larger than would be considered practical to distribute to end user machines, and can easily update 50 times a day. Web developers have to think about latency on specific interactions, but they don't have to think about the size of the entire app and that allows them to scale up feature sets as fast as funding allows. In contrast the closed world mobile versions of their sites are a parade of horror stories in which firms have to e.g. hotpatch Dalvik to work around method count limits (Facebook), or in which code size issues nearly wrecked the entire company (Uber):
https://twitter.com/StanTwinB/status/1336914412708405248
Right now code size isn't a particularly serious problem for us, but the ease of including open source libraries means footprint grows all the time. Especially for our shell scripting tool, there are tons of cool features that could be added but if we did all of them we'd probably end up with 500mb of bytecode. With an open world features can be downloaded on the fly as they get used and you can build a plugin ecosystem.
The new more incremental direction of Leyden is thus welcomed and appreciated, because it feels like a lot of ground can be covered by "small" changes like upgrading AppCDS and caching compiled hotspots. Even if the results aren't as impressive as with native-image, the benefits of keeping an open world can probably make up for it, at least for our use cases.
Thanks Andrew. Yes, I saw the InfoQ article, it's excellent. Actually it was reading that which prompted me to sign up and write out these notes.
Thank you for the excellent write-up! Although many problems you've mentioned are not solved (and sometimes are made worse) by CRaC, I can't resist mentioning a CRaC change for CLI apps [1]. But this is offtopic, so BCCing leyden-dev and CC crac-dev. On 6/1/22 16:03, Mike Hearn wrote:
What about CRaC? It's Linux only so isn't interesting to us, given that most devs are on Windows/macOS. The benefits for Linux servers are clear though. Obvious question - can you make a snapshot on one machine/Linux distro, and resume them on a totally different one, or does it require a homogenous infrastructure?
In the current implementation, we've not started working on this. By the model, CRaC prevents file dependencies at the checkpoint and allows VM to coordinate restore. So eventually we should deliver images that do not depend on the particular CPU and distribution. The feasibility of the full implementation for Mac and Windows OS is unclear. But I think a reasonable effort will be required to provide an implementation for testing and developing programs on those OSes, which will match the behavior of Linux CRaC implementation. Thanks, Anton
Hi Mike, I am thrilled to hear that you're happy with CDS. Please see my responses below. If you have other questions or requests for CDS, please let me know :-) On 6/1/2022 6:03 AM, Mike Hearn wrote:
Hi,
It feels like most of the interest in static Java comes from the microservices / functions-as-a-service community. My new company spent the last year creating a developer tool that runs on the JVM (which will be useful for Java developers actually, but what it does is irrelevant here). Internally it's a kind of build system and is thus a large(ish) CLI app in which startup time and throughput are what matter most. We also have a separate internal tool that uses Kotlin scripting to implement a bash-like scripting language, and which is sensitive in the same ways.
Today the JVM is often overlooked for writing CLI apps due to startup time, 'lightness' and packaging issues. I figured I'd write down some notes based on our experiences. They cover workflow, performance, implementation costs and security issues. Hopefully it's helpful.
1.
I really like AppCDS because:
a. It can't break the app so switching it on/off a no-brainer. Unlike native-image/static java, no additional testing overhead is created by it.
b. It's effective even without heap snapshotting. We see a ~40% speedup for executing --help
c. It's pay-as-you-go. We can use a small archive that's fast to create to accelerate just the most latency sensitive startup paths, or we can use it for the whole app, but ultimately costs are controllable.
d. Archives are deterministic. Modern client-side packaging systems support delta updates, and CDS plays nicely with them. GraalVM native images are non-deterministic so every update is going to replace the entire app, which isn't much fun from an update speed or bandwidth consumption perspective.
Startup time is dominated by PicoCLI which is a common problem for Java CLI apps. Supposedly the slowest part is building the model of the CLI interface using reflection, so it's a perfect candidate for AppCDS heap snapshotting. I say supposedly, because I haven't seen concrete evidence that this is actually where the time goes, but it seems like a plausible belief. There's a long standing bug filed to replace reflection with code generation but it's a big job and so nobody did it.
Unfortunately the app will ship without using AppCDS. Some workflow issues remain. These can be solved in the app itself, but it'd be nice if the JVM does it.
The obvious way to use CDS is to ship an archive with the app. We might do this as a first iteration, but longer term don't want to for two reasons:
a. The archive can get huge. b. Signature verification penalties on macOS (see below).
For just making --help and similar short commands faster size isn't so bad (~6-10mb for us), but if it's used for a whole execution the archive size for a standard run is nearly the same as total bytecode size of the app. As more stuff gets cached this will get worse. Download size might not matter much for this particular app, but as a general principle it does. So a nice improvement would be to generate it client side.
CDS files are caches and different platforms have different conventions for where those go. The JVM doesn't know about those conventions but our app does, so we'd need our custom native code launcher (which exists anyway for other reasons) to set the right paths for CDS.
Then you have to pick the right flags depending on whether the CDS file exists or not. I follow CDS related changes and believe this is fixed in latest Java versions but maybe (?) not released yet.
Which version of Java are you using? Since JDK 11, the default value of -Xshare is set to -Xshare:auto, so you can always do this: $ java -XX:SharedArchiveFile=nosuch.jsa -version java version "11" 2018-09-25 Java(TM) SE Runtime Environment 18.9 (build 11+28) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode) If the file exists, it will be used automatically. Otherwise the VM will silently ignore the archive. Since JDK 17, a default CDS archive is shipped with the JDK. So you will at least get some performance benefits of CDS for the built-in classes. With the upcoming JDK 19, we have implemented a new feature (See JDK-8261455) to automatically create the CDS archive. Here's an example (I am using Javac because it's convenient, but you need to quote the JVM parameters with -J): $ javac -J-XX:+AutoCreateSharedArchive -J-XX:SharedArchiveFile=javac.jsa HelloWorld.java javac.jsa will be automatically created if it doesn't exist, or if it's not compatible with the JVM (e.g., if you have upgraded to a newer JDK). In this case, the total elapsed time is improved from about 522ms (with default CDS archive) to 330ms (auto-generated archive).
Even once that's fixed it's not quite obvious that we'd use it. The JVM runs much slower when dumping a dynamic CDS archive and the first run is when first impressions are made. Whilst for cloud stuff this is a matter of (artificially?) expensive resources, for CLI apps it's about more subjective things like feeling snappy. One idea is to delay dumping a CDS archive until after the first run is exiting, so it doesn't get in the way. The first run wouldn't benefit from the archive which is a pity (except on Linux where the package managers make it easy to run code post-install), but it at least wouldn't be slowed down by creating it either. The native launcher can schedule this. Alternatively there could be a brief pause on first run when the user is told explicitly that the app is optimizing itself, but how feasible that is depends very much on dump speed. Finally we could ship a small archive that only covers startup, and then in parallel make a dump of a full run in the background.
The dynamic CDS dumping happens when the JVM exits. We could ... (just throwing half-baked ideas) spawn a new daemon subprocess to do the dumping, while the main JVM process exits. So to the user there's no penalty.
Speaking of which, there's a need for some protocol to drive an app through a representative 'trial run'. Whether it's generating the class list or the archive itself, it could be as simple as an alternative static method that sits next to main. If it were to be standardized the rest of the infrastructure becomes more re-usable, for instance build systems can take care of generating classlists, or the end-user packaging can take care of dynamic dumping.
Maybe we could have some sort of daemon that collects profiling data in the background, and update the archives when the application behavior is more understood.
CDS has two modes and it's not clear which is better. I'm unusually obsessive about this stuff to the extent of reading the CDS source code, but despite that I have absolutely no idea if I should be trying to use static or dynamic archives. There used to be a performance difference between them but maybe it's fixed now? There's a lack of end-to-end guidance on how to exploit this feature best.
I agree our documentation is kind of lacking. We'll try to improve it. Static and dynamic archives will be roughly the same speed (~10 ms faster with static dump for the javac example above). The dynamic archive will be smaller, because it doesn't need to duplicate the built-in classes that are already in the static archive. Here's a size comparison for javac.jsa static: 20,217,856 bytes dynamic: 10,153,984 bytes
The ideal would obviously be losing the dump/exec split and make dynamic dumping continuous, incremental and imposing no performance penalty. Then we could just supply a path to where the CDS file should go and things magically warm up across executions. I have no idea how feasible that is.
Once AppCDS archives are in place and being created at the right times, a @Snapshotted annotation for fields (or similar) should be an easy win to eliminate the bulk of the rest of the PicoCLI time. Dynamically loaded heaps would also be useful to eliminate the overhead of loading configs and instantiating the (build) task graph without a Gradle-style daemon.
2.
AppCDS archives can open a subtle security issue when distributing code to desktop platforms. Because they're full of vtables anyone who can write to them can (we assume) take over any JVM that loads the archive and gain whatever privileges have been granted to that app. The archive file is fully trusted.
Will you have a similar problem if the JAR file of the application is maliciously modified? Actually the vtables inside the CDS archive file contain all zeros, and are filled in by the VM after the archive is mapped. What could be modified is the vtptr of archived MetaData objects. They usually point to somewhere near 0x800000000 (where the vtables are) but the attacker could modify them to point to arbitrary locations. I am not sure if this type of attack is easier than modifying the JAR files, or not. Thanks - Ioi
On Windows and Linux this doesn't matter. On Linux sensitive files can be packaged or created in postinst scripts. On Windows either an app comes with a legacy installer/MSI file and thus doesn't have any recognized package identity that can be granted extra permissions, or it uses the current gen MSIX system. In the latter case Windows has a notion of app identity and so you can request permissions to access e.g. keychain entries, the user's calendar etc, but in that case Windows also gives you a private directory that's protected from other apps where sensitive files can be stashed. AppCDS archives can go there and we're done.
MacOS is a problem child. There are two situations that matter.
In the first case archives are shipped as data files with the app. Security is not an issue here, but there's a subtle performance footgun. On most platforms signatures of files shipped with an app are checked at install time but on macOS they aren't. Thanks to its NeXT roots it doesn't really have an installation concept, and thus the kernel checks signatures of files on first use then caches the signature check in the kernel vnode. By default the entire file is hashed in order to link it back to the root signature, which for large files can impose a small but noticeable delay before the app can open them. This first run penalty is unfortunate given that AppCDS exists partly to improve startup time. You can argue it doesn't matter much due to the caching, but it's worth being aware of - very large AppCDS archives would get fully paged in and hashed before the app even gets to do anything. In turn that means people might enable AppCDS with a big classlist expecting it to speed things up, not noticing that for Mac users only it slowed things down instead. There are ways to fix this using supported Apple APIs. One is to supply a CodeDirectory structure stored in extended attributes: you should get incremental hashing and normal page fault behaviour (untested!). Another is to wrap the data in a Mach-O file.
In the second case the CDS archive is being generated client side. Mac apps don't have anywhere they can create tamperproof data, except for very small amounts in the keychain. Thus if a Mac app opens a malicious cache file that can take control of it that's a security bug, because it'd allow one program to grab any special privileges the user granted to another. The fact that the grabbing program has passed GateKeeper and notarization doesn't necessarily matter (Apple's guidance on this is unclear, but it seems plausible that this is their stance). In this case the key chain can be used as a root of trust by storing a hash of the CDS archive in it and checking that after mmap/before use. Alternatively, again, Apple provides an API that lets you associate an on-disk (xattr) CodeDirectory structure with a file which will then be checked incrementally at page fault time. Extreme care must be taken to avoid race conditions, but in theory, a CodeDirectory structure can be computed at dump time, written to disk as an xattr, and then stored again in the key chain (e.g. by pretending it's a "key" or "password"). After the security API is instructed to associate a CD with the file, it can be checked against the tamperproofed version stored in the key chain and if they match, the archive can then be mmapped and used as normal.
Native images don't have these issues because the state snapshot is stored inside the Mach-O file and thus gets covered by the normal mechanisms. However once it adds support for persisted heaps, the same issue may arise.
Whether it's worth doing the extra work to solve this is unclear. Macs are guaranteed to come with very fast NVMe disks and CPUs. Still, it's worth being aware of the issue.
3.
Why not just use a native image then? Maybe we'll do that because the performance wins are really compelling, but again, v1 will ship without this for the following reasons:
a. Static minification can break things. Our integration tests currently invoke the entry point of the app directly, but that could be fixed to run the tool in an external process. For unit tests the situation is far murkier. It's a bit unclear how to run JUnit tests against the statically compiled version and it may not even make sense (because the tests would pin a bunch of code that might get stripped in the real app so what are you really testing?).
b. It'd break delta updates. Not the end of the world, but a factor.
c. I have no idea if we're using any libraries that spin bytecode dynamically. Even if we're not today, what if tomorrow we want to use such a library? Do we have to avoid using it and increase the cost of feature development, or roll back the native image and give our users a nasty performance downgrade? Neither option is attractive. Ideally SubstrateVM would contain a bytecode interpreter and use it when necessary. Lots of issues there but e.g. it'd probably be OK if it's not a general classloader and the code dependencies have to be known AOT.
d. Similar to (c), fully AOT compilation can explode code and thus download size even though many codepaths are cold and only execute once. It'd be nice if a native image could include a mix of bytecode and AOT compiled hotspots.
e. Once you're past the initial interactive stage the program is throughput sensitive. How much of a perf downgrade over HotSpot would we get, if any? With GraalVM EE we could use PGO and not lose any, but the ISV pricing is opaque. At any rate to answer this we have to fix the compatibility issues first. The prospect of improving startup time and then discovering we slowed down the actual builds isn't really appealing (though I suspect in our case AOT wouldn't really hurt much).
f. What if we want to support in-process plugins? Maybe we can use Espresso, but this is a road less travelled (lack of tutorials, well documented examples etc).
An interesting possibility is using a mix of approaches. For the bash competitor I mentioned earlier dynamic code loading is needed because the script bytecode is loaded into the host JVM, but the Kotlin compiler itself could theoretically be statically compiled to a JNI or Panama-accessible library. We tried this before and hit compatibility errors, but didn't make any effort to resolve them.
4.
What about CRaC? It's Linux only so isn't interesting to us, given that most devs are on Windows/macOS. The benefits for Linux servers are clear though. Obvious question - can you make a snapshot on one machine/Linux distro, and resume them on a totally different one, or does it require a homogenous infrastructure?
5.
A big reason AppCDS is nice is we get to keep the open world. This isn't only about compatibility, open worlds are just better. The most popular way to get software to desktop machines is Chrome and the web is totally open world. Apps are downloaded incrementally as the user navigates around, and companies exploit this fact aggressively. Large web sites can be far larger than would be considered practical to distribute to end user machines, and can easily update 50 times a day. Web developers have to think about latency on specific interactions, but they don't have to think about the size of the entire app and that allows them to scale up feature sets as fast as funding allows. In contrast the closed world mobile versions of their sites are a parade of horror stories in which firms have to e.g. hotpatch Dalvik to work around method count limits (Facebook), or in which code size issues nearly wrecked the entire company (Uber):
https://twitter.com/StanTwinB/status/1336914412708405248
Right now code size isn't a particularly serious problem for us, but the ease of including open source libraries means footprint grows all the time. Especially for our shell scripting tool, there are tons of cool features that could be added but if we did all of them we'd probably end up with 500mb of bytecode. With an open world features can be downloaded on the fly as they get used and you can build a plugin ecosystem.
The new more incremental direction of Leyden is thus welcomed and appreciated, because it feels like a lot of ground can be covered by "small" changes like upgrading AppCDS and caching compiled hotspots. Even if the results aren't as impressive as with native-image, the benefits of keeping an open world can probably make up for it, at least for our use cases.
On 6/2/2022 5:15 PM, Ioi Lam wrote:
Hi Mike,
I am thrilled to hear that you're happy with CDS. Please see my responses below.
If you have other questions or requests for CDS, please let me know :-)
On 6/1/2022 6:03 AM, Mike Hearn wrote:
Hi,
It feels like most of the interest in static Java comes from the microservices / functions-as-a-service community. My new company spent the last year creating a developer tool that runs on the JVM (which will be useful for Java developers actually, but what it does is irrelevant here). Internally it's a kind of build system and is thus a large(ish) CLI app in which startup time and throughput are what matter most. We also have a separate internal tool that uses Kotlin scripting to implement a bash-like scripting language, and which is sensitive in the same ways.
Today the JVM is often overlooked for writing CLI apps due to startup time, 'lightness' and packaging issues. I figured I'd write down some notes based on our experiences. They cover workflow, performance, implementation costs and security issues. Hopefully it's helpful.
1.
I really like AppCDS because:
a. It can't break the app so switching it on/off a no-brainer. Unlike native-image/static java, no additional testing overhead is created by it.
b. It's effective even without heap snapshotting. We see a ~40% speedup for executing --help
c. It's pay-as-you-go. We can use a small archive that's fast to create to accelerate just the most latency sensitive startup paths, or we can use it for the whole app, but ultimately costs are controllable.
d. Archives are deterministic. Modern client-side packaging systems support delta updates, and CDS plays nicely with them. GraalVM native images are non-deterministic so every update is going to replace the entire app, which isn't much fun from an update speed or bandwidth consumption perspective.
Startup time is dominated by PicoCLI which is a common problem for Java CLI apps. Supposedly the slowest part is building the model of the CLI interface using reflection, so it's a perfect candidate for AppCDS heap snapshotting. I say supposedly, because I haven't seen concrete evidence that this is actually where the time goes, but it seems like a plausible belief. There's a long standing bug filed to replace reflection with code generation but it's a big job and so nobody did it.
Unfortunately the app will ship without using AppCDS. Some workflow issues remain. These can be solved in the app itself, but it'd be nice if the JVM does it.
The obvious way to use CDS is to ship an archive with the app. We might do this as a first iteration, but longer term don't want to for two reasons:
a. The archive can get huge. b. Signature verification penalties on macOS (see below).
For just making --help and similar short commands faster size isn't so bad (~6-10mb for us), but if it's used for a whole execution the archive size for a standard run is nearly the same as total bytecode size of the app. As more stuff gets cached this will get worse. Download size might not matter much for this particular app, but as a general principle it does. So a nice improvement would be to generate it client side.
CDS files are caches and different platforms have different conventions for where those go. The JVM doesn't know about those conventions but our app does, so we'd need our custom native code launcher (which exists anyway for other reasons) to set the right paths for CDS.
Then you have to pick the right flags depending on whether the CDS file exists or not. I follow CDS related changes and believe this is fixed in latest Java versions but maybe (?) not released yet.
Which version of Java are you using?
Since JDK 11, the default value of -Xshare is set to -Xshare:auto, so you can always do this:
$ java -XX:SharedArchiveFile=nosuch.jsa -version java version "11" 2018-09-25 Java(TM) SE Runtime Environment 18.9 (build 11+28) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)
If the file exists, it will be used automatically. Otherwise the VM will silently ignore the archive.
Since JDK 17, a default CDS archive is shipped with the JDK. So you will at least get some performance benefits of CDS for the built-in classes.
With the upcoming JDK 19, we have implemented a new feature (See JDK-8261455) to automatically create the CDS archive. Here's an example (I am using Javac because it's convenient, but you need to quote the JVM parameters with -J):
$ javac -J-XX:+AutoCreateSharedArchive -J-XX:SharedArchiveFile=javac.jsa HelloWorld.java
javac.jsa will be automatically created if it doesn't exist, or if it's not compatible with the JVM (e.g., if you have upgraded to a newer JDK).
In this case, the total elapsed time is improved from about 522ms (with default CDS archive) to 330ms (auto-generated archive).
Even once that's fixed it's not quite obvious that we'd use it. The JVM runs much slower when dumping a dynamic CDS archive and the first run is when first impressions are made. Whilst for cloud stuff this is a matter of (artificially?) expensive resources, for CLI apps it's about more subjective things like feeling snappy. One idea is to delay dumping a CDS archive until after the first run is exiting, so it doesn't get in the way. The first run wouldn't benefit from the archive which is a pity (except on Linux where the package managers make it easy to run code post-install), but it at least wouldn't be slowed down by creating it either. The native launcher can schedule this. Alternatively there could be a brief pause on first run when the user is told explicitly that the app is optimizing itself, but how feasible that is depends very much on dump speed. Finally we could ship a small archive that only covers startup, and then in parallel make a dump of a full run in the background.
The dynamic CDS dumping happens when the JVM exits. We could ... (just throwing half-baked ideas) spawn a new daemon subprocess to do the dumping, while the main JVM process exits. So to the user there's no penalty.
Speaking of which, there's a need for some protocol to drive an app through a representative 'trial run'. Whether it's generating the class list or the archive itself, it could be as simple as an alternative static method that sits next to main.
One thing you *could* do with JDK 19 on Linux is: java -XX:+AutoCreateSharedArchive -XX:SharedArchiveFile=app.jsa -jar MyApp In you main method, check the /proc/self/maps file to see if app.jsa is mapped. If not, the VM is dumping the dynamic CDS archive. In this case, your app can run in a special "trial run" mode that exercises different functionalities. To make this easier to use, we could add a special system property, something like "jdk.cds.is.dumping", that can be queried by the application. Thanks - Ioi
If it were to be standardized the rest of the infrastructure becomes more re-usable, for instance build systems can take care of generating classlists, or the end-user packaging can take care of dynamic dumping.
Maybe we could have some sort of daemon that collects profiling data in the background, and update the archives when the application behavior is more understood.
CDS has two modes and it's not clear which is better. I'm unusually obsessive about this stuff to the extent of reading the CDS source code, but despite that I have absolutely no idea if I should be trying to use static or dynamic archives. There used to be a performance difference between them but maybe it's fixed now? There's a lack of end-to-end guidance on how to exploit this feature best.
I agree our documentation is kind of lacking. We'll try to improve it.
Static and dynamic archives will be roughly the same speed (~10 ms faster with static dump for the javac example above).
The dynamic archive will be smaller, because it doesn't need to duplicate the built-in classes that are already in the static archive. Here's a size comparison for javac.jsa
static: 20,217,856 bytes dynamic: 10,153,984 bytes
The ideal would obviously be losing the dump/exec split and make dynamic dumping continuous, incremental and imposing no performance penalty. Then we could just supply a path to where the CDS file should go and things magically warm up across executions. I have no idea how feasible that is.
Once AppCDS archives are in place and being created at the right times, a @Snapshotted annotation for fields (or similar) should be an easy win to eliminate the bulk of the rest of the PicoCLI time. Dynamically loaded heaps would also be useful to eliminate the overhead of loading configs and instantiating the (build) task graph without a Gradle-style daemon.
2.
AppCDS archives can open a subtle security issue when distributing code to desktop platforms. Because they're full of vtables anyone who can write to them can (we assume) take over any JVM that loads the archive and gain whatever privileges have been granted to that app. The archive file is fully trusted.
Will you have a similar problem if the JAR file of the application is maliciously modified?
Actually the vtables inside the CDS archive file contain all zeros, and are filled in by the VM after the archive is mapped.
What could be modified is the vtptr of archived MetaData objects. They usually point to somewhere near 0x800000000 (where the vtables are) but the attacker could modify them to point to arbitrary locations. I am not sure if this type of attack is easier than modifying the JAR files, or not.
Thanks - Ioi
On Windows and Linux this doesn't matter. On Linux sensitive files can be packaged or created in postinst scripts. On Windows either an app comes with a legacy installer/MSI file and thus doesn't have any recognized package identity that can be granted extra permissions, or it uses the current gen MSIX system. In the latter case Windows has a notion of app identity and so you can request permissions to access e.g. keychain entries, the user's calendar etc, but in that case Windows also gives you a private directory that's protected from other apps where sensitive files can be stashed. AppCDS archives can go there and we're done.
MacOS is a problem child. There are two situations that matter.
In the first case archives are shipped as data files with the app. Security is not an issue here, but there's a subtle performance footgun. On most platforms signatures of files shipped with an app are checked at install time but on macOS they aren't. Thanks to its NeXT roots it doesn't really have an installation concept, and thus the kernel checks signatures of files on first use then caches the signature check in the kernel vnode. By default the entire file is hashed in order to link it back to the root signature, which for large files can impose a small but noticeable delay before the app can open them. This first run penalty is unfortunate given that AppCDS exists partly to improve startup time. You can argue it doesn't matter much due to the caching, but it's worth being aware of - very large AppCDS archives would get fully paged in and hashed before the app even gets to do anything. In turn that means people might enable AppCDS with a big classlist expecting it to speed things up, not noticing that for Mac users only it slowed things down instead. There are ways to fix this using supported Apple APIs. One is to supply a CodeDirectory structure stored in extended attributes: you should get incremental hashing and normal page fault behaviour (untested!). Another is to wrap the data in a Mach-O file.
In the second case the CDS archive is being generated client side. Mac apps don't have anywhere they can create tamperproof data, except for very small amounts in the keychain. Thus if a Mac app opens a malicious cache file that can take control of it that's a security bug, because it'd allow one program to grab any special privileges the user granted to another. The fact that the grabbing program has passed GateKeeper and notarization doesn't necessarily matter (Apple's guidance on this is unclear, but it seems plausible that this is their stance). In this case the key chain can be used as a root of trust by storing a hash of the CDS archive in it and checking that after mmap/before use. Alternatively, again, Apple provides an API that lets you associate an on-disk (xattr) CodeDirectory structure with a file which will then be checked incrementally at page fault time. Extreme care must be taken to avoid race conditions, but in theory, a CodeDirectory structure can be computed at dump time, written to disk as an xattr, and then stored again in the key chain (e.g. by pretending it's a "key" or "password"). After the security API is instructed to associate a CD with the file, it can be checked against the tamperproofed version stored in the key chain and if they match, the archive can then be mmapped and used as normal.
Native images don't have these issues because the state snapshot is stored inside the Mach-O file and thus gets covered by the normal mechanisms. However once it adds support for persisted heaps, the same issue may arise.
Whether it's worth doing the extra work to solve this is unclear. Macs are guaranteed to come with very fast NVMe disks and CPUs. Still, it's worth being aware of the issue.
3.
Why not just use a native image then? Maybe we'll do that because the performance wins are really compelling, but again, v1 will ship without this for the following reasons:
a. Static minification can break things. Our integration tests currently invoke the entry point of the app directly, but that could be fixed to run the tool in an external process. For unit tests the situation is far murkier. It's a bit unclear how to run JUnit tests against the statically compiled version and it may not even make sense (because the tests would pin a bunch of code that might get stripped in the real app so what are you really testing?).
b. It'd break delta updates. Not the end of the world, but a factor.
c. I have no idea if we're using any libraries that spin bytecode dynamically. Even if we're not today, what if tomorrow we want to use such a library? Do we have to avoid using it and increase the cost of feature development, or roll back the native image and give our users a nasty performance downgrade? Neither option is attractive. Ideally SubstrateVM would contain a bytecode interpreter and use it when necessary. Lots of issues there but e.g. it'd probably be OK if it's not a general classloader and the code dependencies have to be known AOT.
d. Similar to (c), fully AOT compilation can explode code and thus download size even though many codepaths are cold and only execute once. It'd be nice if a native image could include a mix of bytecode and AOT compiled hotspots.
e. Once you're past the initial interactive stage the program is throughput sensitive. How much of a perf downgrade over HotSpot would we get, if any? With GraalVM EE we could use PGO and not lose any, but the ISV pricing is opaque. At any rate to answer this we have to fix the compatibility issues first. The prospect of improving startup time and then discovering we slowed down the actual builds isn't really appealing (though I suspect in our case AOT wouldn't really hurt much).
f. What if we want to support in-process plugins? Maybe we can use Espresso, but this is a road less travelled (lack of tutorials, well documented examples etc).
An interesting possibility is using a mix of approaches. For the bash competitor I mentioned earlier dynamic code loading is needed because the script bytecode is loaded into the host JVM, but the Kotlin compiler itself could theoretically be statically compiled to a JNI or Panama-accessible library. We tried this before and hit compatibility errors, but didn't make any effort to resolve them.
4.
What about CRaC? It's Linux only so isn't interesting to us, given that most devs are on Windows/macOS. The benefits for Linux servers are clear though. Obvious question - can you make a snapshot on one machine/Linux distro, and resume them on a totally different one, or does it require a homogenous infrastructure?
5.
A big reason AppCDS is nice is we get to keep the open world. This isn't only about compatibility, open worlds are just better. The most popular way to get software to desktop machines is Chrome and the web is totally open world. Apps are downloaded incrementally as the user navigates around, and companies exploit this fact aggressively. Large web sites can be far larger than would be considered practical to distribute to end user machines, and can easily update 50 times a day. Web developers have to think about latency on specific interactions, but they don't have to think about the size of the entire app and that allows them to scale up feature sets as fast as funding allows. In contrast the closed world mobile versions of their sites are a parade of horror stories in which firms have to e.g. hotpatch Dalvik to work around method count limits (Facebook), or in which code size issues nearly wrecked the entire company (Uber):
https://twitter.com/StanTwinB/status/1336914412708405248
Right now code size isn't a particularly serious problem for us, but the ease of including open source libraries means footprint grows all the time. Especially for our shell scripting tool, there are tons of cool features that could be added but if we did all of them we'd probably end up with 500mb of bytecode. With an open world features can be downloaded on the fly as they get used and you can build a plugin ecosystem.
The new more incremental direction of Leyden is thus welcomed and appreciated, because it feels like a lot of ground can be covered by "small" changes like upgrading AppCDS and caching compiled hotspots. Even if the results aren't as impressive as with native-image, the benefits of keeping an open world can probably make up for it, at least for our use cases.
Hi Ioi, We're using a JDK 17 with a few backports. Unfortunately the default CDS archives goes missing during jlinking. It's an easy fix. Actually, the product in question is a packaging tool, it's not only for the JVM but it supports JVM apps quite well, and re-creating the CDS archive post-jlink is on the list of features to add. It's packaged with itself so that'll fix it for our apps too.
$ javac -J-XX:+AutoCreateSharedArchive -J-XX:SharedArchiveFile=javac.jsa HelloWorld.java
javac.jsa will be automatically created if it doesn't exist, or if it's not compatible with the JVM (e.g., if you have upgraded to a newer JDK).
Yes, that's a nice improvement in usability. By the way, don't forget -Xlog:cds=off because otherwise CDS likes to write lots of warnings to the terminal (not a great look for a CLI app).
The dynamic CDS dumping happens when the JVM exits.
Yes but it seems to slow down execution before that time as well. Here's some timings for our app to parse CLI options, read the build config, compute the task graph, print the available tasks, and reach the end of main(): - With CDS off: ~0.8 seconds - With CDS dumping active: ~1.25 seconds - With CDS active: ~0.6 seconds so the app appears to run ~50% slower when dynamic dumping is active and that's not including the dump time itself. That's why I'm suggesting doing it in the background as a totally separate post-install step (with background forking required for platforms that don't support or strongly discourage install scripts). I get the impression this may not be expected? Is the JVM genuinely doing extra work at runtime when dynamic dumping is active?
Maybe we could have some sort of daemon that collects profiling data in the background, and update the archives when the application behavior is more understood.
Sure, the ideal would be something like "always dumping" mode in which there's no slowdown. So you just give the JVM a directory (or >1 directory) and it caches internal structures, JITd code and persistent heap snapshots there. Fire and forget. Then if you want to trade off bandwidth vs first run time you can pre-populate the first directory in the list with the results of a short run, like just getting to first pixels for a desktop app or flag handling for a CLI app, and any additional data generated goes into the second directory. Bonus points if you find a way to share those directories over an NFS mount - then you have a JIT server 'for free' in cloud deployments.
The dynamic archive will be smaller, because it doesn't need to duplicate the built-in classes that are already in the static archive.
Right. That's true. I'd forgotten that you can combine them like that. So we could ship a small static archive in the download that just accelerates time-to-first-interaction, and generate a larger dump client-side in the background that covers the whole execution.
Will you have a similar problem if the JAR file of the application is maliciously modified?
If they're downloaded and stored in the home directory, yes, but, JARs support code signing with per-file hashing so there's a way to fix that built in to the platform. If they're just shipped as data files in the app then it doesn't matter because they're signed and tamperproofed using OS specific mechanisms. All this is a bit theoretical. IntelliJ downloads unsigned JARs as plugins and nobody seems to care. It's possible that's because it doesn't request any special privileges so there's nothing to attack, but in macOS things as basic as access to ~/Downloads is a permission these days. Also JetBrains are moving to code signing their JARs anyway. So ... yeah. Like I said. Hard to know how much to really care about this. It might be one of those things that doesn't matter until the day it does.
What could be modified is the vtptr of archived MetaData objects. They usually point to somewhere near 0x800000000 (where the vtables are) but the attacker could modify them to point to arbitrary locations. I am not sure if this type of attack is easier than modifying the JAR files, or not.
Well, the issue here is a combination of where the files are generated and performance. Again it's all a bit theoretical because the performance discussion is rooted in the "disk access is slow" world which isn't really true anymore. I've done some casual tests on my laptop and did appear to see a real slowdown from this "hash whole file on open" effect, but, it was a while ago and it wasn't rigorous at all. It's also a total PITA to reproduce because there's no explicit way to flush the cache, so you have to constantly re-copy signed binaries over and over to force kernel cache misses. If I explained how I measured this, Alexey Shipilev would yell at me :) so I'll just leave it here as food for thought instead. And yeah it's also not clear how much the uncached times matter these days. Years ago it mattered a lot because people rebooted their machines often, but Macs hibernate all the time and reboot only rarely so the caches will remain warm. I don't think treating AppCDS archives as hostile in the JVM itself would be worth it. This is a Mac specific issue and that would be a major constraint e.g. it'd mean you can't cache JITd native code in the archives. Doesn't make sense. Better to tamperproof unbundled archives in other ways, like computing ad-hoc signatures and stashing the CodeDirectory in an xattr (if it ever matters).
participants (4)
-
Andrew Dinn
-
Anton Kozlov
-
Ioi Lam
-
Mike Hearn