premain: negative lookup cache for class loaders

Fri Jan 12 00:39:06 UTC 2024

> We know that /successful/ lookups go fast the second time because the VM 
> caches the result in a central system dictionary. And, CDS technology 
> makes successful lookups go fast the /first time/, if the lookup was 
> performed in a training run and the resulting state stored in a CDS 
> archive. (Those who watch our premain branch will see that there is lots 
> of low-hanging fruit in CDS, that we are only beginning to enjoy.)

Even though repeated successful lookups are already fast it is still 
benefitial to optimize them. For example, class pre-loading and CP entry 
pre-resolution are implemented in premain and do give noticeable startup 
improvements.

And repeated successful lookups are common when it comes to 
Class.forName(). For example, PetClinic deployment run experiences 10k 
calls into JVM_FindClassFromCaller which cost ~20ms (measured on M1 Pro).

So, while negative lookup cache looks like the lowest hanging fruit, 
it's worth to consider positive lookup caching scenario as well.

Best regards,
Vladimir Ivanov

> But, a /failed/ lookup is not recorded anywhere. So every distinct 
> lookup must start again from first principles and fail all over again. 
> For some workloads this costs a small but measurable percentage of 
> startup time.
> 
> The story is different for the local |CONSTANT_Class| entries in any 
> given classfile: The JVMS mandates that both successful and failed 
> lookups are recorded on the first attempt (per CP entry per se, not 
> globally and not per class). Global usage includes both use of 
> |Class.forName| and the “back end” logic for CP entry resolution. CP 
> resolution is performed at most once per CP entry, and (win or lose) is 
> made sticky on the CP itself, locally.
> 
> To summarize, we can say that, for class lookup, both success and 
> failure are “sticky” locally, and success is “sticky” globally, but 
> failure is “not sticky” globally.
> 
> The global behavior can be thought of either specific to a class loader 
> (i.e., coded in JDK code) or as something in the VM or JNI code that 
> works with the JDK code. In reality it is an emergent property of a 
> number of small details in both.
> 
> A /negative lookup cache/ is a collection of class names (for a given 
> loader) which have already failed to load. “Sticky failure” could be 
> implemented with a negative lookup cache, either on a class loader (my 
> preferred solution, I think) or else somewhere in the VM internals that 
> participate in class loading paths.
> 
> The benefits are obvious: Startup could be shorter by tens of 
> milliseconds. The eliminated operations include re-creating exceptions, 
> and throwing and catching them, and (maybe) uselessly re-probing the 
> file system.
> 
> The risks include at least two cases. First, a user might somehow 
> contrive to extend the class path after a failure has been made sticky, 
> and then the user could be disappointed when a class appears on the new 
> class path components that satisfies the load. Second, a user might 
> somehow contrive to mutate an existing class path component (by writing 
> a file into a directory, say), and have the same disappointment of not 
> seeing the classfile get picked up on the next request.
> 
> But it seems to me that a negative lookup cache is a legitimate 
> optimization /for well behaved class loaders/. (Please check my work 
> here!) The preconditions are that the well behaved class takes its input 
> from inputs that cannot be updated after the VM has started running. Or, 
> if and when those inputs are updated somehow, the negative cache must be 
> invalidated, at least for classes that could possibly be loaded from the 
> updated parts. You can sometimes reason from the package prefix and from 
> the class path updates that some name cannot be read from some class 
> path element, just because of a missing directory.
> 
> A CDS archive records its class path, and can detect whether that class 
> path reads only from an immutable backing store. (This is a sweet spot 
> for Leyden.) If that is the case, then the CDS archive could also store 
> a negative lookup cache (for each eligible class loader). I think this 
> should be done in Java code and the relevant field and its data 
> special-cased to be retained via CDS.
> 
> (I mean “special-cased” the way we already special-case some other 
> selected data, like the module graph and integer box cache. As with 
> framework-defined class loaders, we may have a conversation in the 
> future about letting user code into this little game as well. But it has 
> to be done in a way that does not violate any specification, which makes 
> it challenging. One step at a time.)
> 
> For immediate prototyping and testing of the concept, we don’t need to 
> bring CDS into the picture. We can just have a global flag that says “it 
> is safe to use a negative lookup cache”. But to roll out this 
> optimization in a product, the flag needs to be automatically set to a 
> safe value, probably by CDS at startup, based on in inspection of the 
> class path settings in both training and deployment runs. And of course 
> (as a separate step) we can pre-populate the caches at CDS dump time 
> (that is, after a training run), so that the deployed application can 
> immediately benefit from the cache, and spend zero time exploring the 
> class path for classes that are known to be missing.
> 
> BTW, I think it is just fine to throw a pre-constructed exception when 
> the negative lookup cache hits, even though some users will complain 
> that such exceptions are lacking meaningful messages and backtraces. 
> It’s within spec. HotSpot does this for certain “hot throws” of built-in 
> exceptions; see |GraphKit::builtin_throw|, and see also the tricky logic 
> that makes failures sticky in CP entries (which edits down the exception 
> information). As a compromise, the negative lookup cache could store an 
> exception object whose message is the class name (but with no backtrace).
> 
> There’s a another way to approach this issue, which is to index the 
> class path in such a way that class loaders can respond to arbitrary 
> load requests but do little or no work on failing requests. A Bloom 
> filter is sometimes used in such cases to avoid many (not all) of the 
> searches. But I think that’s overkill for the use cases we actually 
> observe, which is a large number of failed lookups on a small number of 
> class names. A per-loader table mapping a name to an exception seems to 
> be a good tradeoff. And as I noted, CDS can pre-populate these things 
> eventually.
> 
> Ashutosh, maybe you are interested in working on some of this? :-)
> 
> — John
> 
> P.S. If the negative lookup cache has the right “stability” properties, 
> we can even ask the JIT to think about optimizing failing 
> |Class.forName| calls, by consulting the cache at compile time. In the 
> Leyden setting, some |Class.forName| calls (not all) can be 
> constant-folded. Perhaps the argument is semi-constant and can be 
> profiled and speculated. Maybe some of that pays off, or maybe not; 
> probably not since the |forName| call is probably buried in a stack of 
> middleware. These are ideas for the JIT team to put on their very long list.
> 
> P.P.S. Regarding the two side issues mentioned above…
> 
> We are not at all forgetting about framework-defined class loaders. But 
> for the next few months it is enough to assume that we will optimize 
> only class loaders which are defined by the VM+JDK substrate. In the 
> future we will want to investigate how to make framework-defined loaders 
> compatible with whatever optimizations we create for the well behaved 
> JDK class loaders. It it not yet time to discuss that in detail; it is 
> time to learn the elements of our craft by working with the well behaved 
> class loaders only.
> 
> The same comment applies to the observation that we might try to 
> “auto-train” applications. That is, get rid of the CDS archive, 
> generated by a separate training run, and just automagically run the 
> same application faster the second time, by capturing CDS-like states 
> from the first run, treating it “secretly” as a training run. We know 
> this can work well on some Java workloads. But we also like the 
> predictability and simplicity of CDS. For HotSpot, it is not yet time to 
> work on applying our learnings with CDS to the problem of auto-training. 
> I hope that time will come after we have mined out more of the basic 
> potential of CDS. For now we are working on the “one-step workflow”, 
> where there is an explicit training phase that generates CDS. The 
> “zero-step workflow” will comne in time.
>