premain: negative lookup cache for class loaders

Tue Jan 16 04:44:15 UTC 2024

I think it's worth experimenting and see how much saving can be achieved.

Although the current estimate may be small (11 ms out of 750 ms), it may 
be a code path that's difficult to optimize down the road.

I think we can significantly improve over the current performance by 
shifting more Java computation to build time (e.g., making a heap 
snapshot of computed constants, running <clinit> at build time, etc). We 
should understand where the frameworks are doing such negative class 
lookups, and see if they can be time shifted (or otherwise avoided).

If the answer is no, then I think it's worthwhile to implement a 
*runtime* negative lookup cache. As the overall start-up time goes down, 
the cost of negative class lookup will increase. For example, if it 
becomes 9 ms out of 250ms, then it will be more significant.

Also, are you testing with "AOT" mode for Spring Petclinic -- it's a 
special packaging mode where a lot of the symbolic information are 
resolved at build time, so perhaps it will have a much lower use of 
negative class lookup?

https://github.com/openjdk/leyden/tree/premain/test/hotspot/jtreg/premain/spring-petclinic

Thanks

- Ioi

On 1/12/24 12:32 PM, Ashutosh Mehra wrote:
>
>     Ashutosh, you and your team have mentioned that there are tens of
>     milliseconds (several percentage points of time) consumed during
>     startup of some workloads by /failed/ lookups 
>
>
> While working on Vladimir's suggestion to 
> profile JVM_FindClassFromCaller, I realized I had made a mistake in my 
> earlier attempt to profile the Class.forName method.
> Sadly once I fixed that bug, the time spent in failed lookups is not 
> that significant any more.
>
> This is the patch 
> <https://github.com/ashu-mehra/leyden/commit/0bd59831b387358b60b9f38080ff09081512679a> 
> I have for profiling Class.forName at Java level. It shows the time 
> spent in Class.forName for negative lookups for app class loader.
> For Quarkus app that I am using, the patch reports 11ms which is 1.4% 
> of the startup time (of 750 ms).
> For Springboot-petclinic app the patch reports 36ms which is 1.1% of 
> the startup time (of 3250ms).
>
> The other patch 
> <https://github.com/ashu-mehra/leyden/commit/3923adbb2a3e3291965dd5b85cb7a918db555117> 
> I have is for profiling JVM_FindClassFromCaller when it throws an 
> exception for the app classloader.
> For Quarkus app the patch reports 5ms.
> For Springboot-petclinic app the patch reports 25ms.
>
> Given these numbers, @JohnR do you think it is still worth spending 
> time on the negative cache for the class loaders?
> And sorry for reporting incorrect numbers earlier.
>
> Thanks,
> - Ashutosh Mehra
>
>
> On Thu, Jan 11, 2024 at 7:39 PM Vladimir Ivanov 
> <vladimir.x.ivanov at oracle.com> wrote:
>
>
>     > We know that /successful/ lookups go fast the second time
>     because the VM
>     > caches the result in a central system dictionary. And, CDS
>     technology
>     > makes successful lookups go fast the /first time/, if the lookup
>     was
>     > performed in a training run and the resulting state stored in a CDS
>     > archive. (Those who watch our premain branch will see that there
>     is lots
>     > of low-hanging fruit in CDS, that we are only beginning to enjoy.)
>
>     Even though repeated successful lookups are already fast it is still
>     benefitial to optimize them. For example, class pre-loading and CP
>     entry
>     pre-resolution are implemented in premain and do give noticeable
>     startup
>     improvements.
>
>     And repeated successful lookups are common when it comes to
>     Class.forName(). For example, PetClinic deployment run experiences
>     10k
>     calls into JVM_FindClassFromCaller which cost ~20ms (measured on
>     M1 Pro).
>
>     So, while negative lookup cache looks like the lowest hanging fruit,
>     it's worth to consider positive lookup caching scenario as well.
>
>     Best regards,
>     Vladimir Ivanov
>
>     > But, a /failed/ lookup is not recorded anywhere. So every distinct
>     > lookup must start again from first principles and fail all over
>     again.
>     > For some workloads this costs a small but measurable percentage of
>     > startup time.
>     >
>     > The story is different for the local |CONSTANT_Class| entries in
>     any
>     > given classfile: The JVMS mandates that both successful and failed
>     > lookups are recorded on the first attempt (per CP entry per se, not
>     > globally and not per class). Global usage includes both use of
>     > |Class.forName| and the “back end” logic for CP entry
>     resolution. CP
>     > resolution is performed at most once per CP entry, and (win or
>     lose) is
>     > made sticky on the CP itself, locally.
>     >
>     > To summarize, we can say that, for class lookup, both success and
>     > failure are “sticky” locally, and success is “sticky” globally, but
>     > failure is “not sticky” globally.
>     >
>     > The global behavior can be thought of either specific to a class
>     loader
>     > (i.e., coded in JDK code) or as something in the VM or JNI code
>     that
>     > works with the JDK code. In reality it is an emergent property of a
>     > number of small details in both.
>     >
>     > A /negative lookup cache/ is a collection of class names (for a
>     given
>     > loader) which have already failed to load. “Sticky failure”
>     could be
>     > implemented with a negative lookup cache, either on a class
>     loader (my
>     > preferred solution, I think) or else somewhere in the VM
>     internals that
>     > participate in class loading paths.
>     >
>     > The benefits are obvious: Startup could be shorter by tens of
>     > milliseconds. The eliminated operations include re-creating
>     exceptions,
>     > and throwing and catching them, and (maybe) uselessly re-probing
>     the
>     > file system.
>     >
>     > The risks include at least two cases. First, a user might somehow
>     > contrive to extend the class path after a failure has been made
>     sticky,
>     > and then the user could be disappointed when a class appears on
>     the new
>     > class path components that satisfies the load. Second, a user might
>     > somehow contrive to mutate an existing class path component (by
>     writing
>     > a file into a directory, say), and have the same disappointment
>     of not
>     > seeing the classfile get picked up on the next request.
>     >
>     > But it seems to me that a negative lookup cache is a legitimate
>     > optimization /for well behaved class loaders/. (Please check my
>     work
>     > here!) The preconditions are that the well behaved class takes
>     its input
>     > from inputs that cannot be updated after the VM has started
>     running. Or,
>     > if and when those inputs are updated somehow, the negative cache
>     must be
>     > invalidated, at least for classes that could possibly be loaded
>     from the
>     > updated parts. You can sometimes reason from the package prefix
>     and from
>     > the class path updates that some name cannot be read from some
>     class
>     > path element, just because of a missing directory.
>     >
>     > A CDS archive records its class path, and can detect whether
>     that class
>     > path reads only from an immutable backing store. (This is a
>     sweet spot
>     > for Leyden.) If that is the case, then the CDS archive could
>     also store
>     > a negative lookup cache (for each eligible class loader). I
>     think this
>     > should be done in Java code and the relevant field and its data
>     > special-cased to be retained via CDS.
>     >
>     > (I mean “special-cased” the way we already special-case some other
>     > selected data, like the module graph and integer box cache. As with
>     > framework-defined class loaders, we may have a conversation in the
>     > future about letting user code into this little game as well.
>     But it has
>     > to be done in a way that does not violate any specification,
>     which makes
>     > it challenging. One step at a time.)
>     >
>     > For immediate prototyping and testing of the concept, we don’t
>     need to
>     > bring CDS into the picture. We can just have a global flag that
>     says “it
>     > is safe to use a negative lookup cache”. But to roll out this
>     > optimization in a product, the flag needs to be automatically
>     set to a
>     > safe value, probably by CDS at startup, based on in inspection
>     of the
>     > class path settings in both training and deployment runs. And of
>     course
>     > (as a separate step) we can pre-populate the caches at CDS dump
>     time
>     > (that is, after a training run), so that the deployed
>     application can
>     > immediately benefit from the cache, and spend zero time
>     exploring the
>     > class path for classes that are known to be missing.
>     >
>     > BTW, I think it is just fine to throw a pre-constructed
>     exception when
>     > the negative lookup cache hits, even though some users will
>     complain
>     > that such exceptions are lacking meaningful messages and
>     backtraces.
>     > It’s within spec. HotSpot does this for certain “hot throws” of
>     built-in
>     > exceptions; see |GraphKit::builtin_throw|, and see also the
>     tricky logic
>     > that makes failures sticky in CP entries (which edits down the
>     exception
>     > information). As a compromise, the negative lookup cache could
>     store an
>     > exception object whose message is the class name (but with no
>     backtrace).
>     >
>     > There’s a another way to approach this issue, which is to index the
>     > class path in such a way that class loaders can respond to
>     arbitrary
>     > load requests but do little or no work on failing requests. A Bloom
>     > filter is sometimes used in such cases to avoid many (not all)
>     of the
>     > searches. But I think that’s overkill for the use cases we actually
>     > observe, which is a large number of failed lookups on a small
>     number of
>     > class names. A per-loader table mapping a name to an exception
>     seems to
>     > be a good tradeoff. And as I noted, CDS can pre-populate these
>     things
>     > eventually.
>     >
>     > Ashutosh, maybe you are interested in working on some of this? :-)
>     >
>     > — John
>     >
>     > P.S. If the negative lookup cache has the right “stability”
>     properties,
>     > we can even ask the JIT to think about optimizing failing
>     > |Class.forName| calls, by consulting the cache at compile time.
>     In the
>     > Leyden setting, some |Class.forName| calls (not all) can be
>     > constant-folded. Perhaps the argument is semi-constant and can be
>     > profiled and speculated. Maybe some of that pays off, or maybe not;
>     > probably not since the |forName| call is probably buried in a
>     stack of
>     > middleware. These are ideas for the JIT team to put on their
>     very long list.
>     >
>     > P.P.S. Regarding the two side issues mentioned above…
>     >
>     > We are not at all forgetting about framework-defined class
>     loaders. But
>     > for the next few months it is enough to assume that we will
>     optimize
>     > only class loaders which are defined by the VM+JDK substrate. In
>     the
>     > future we will want to investigate how to make framework-defined
>     loaders
>     > compatible with whatever optimizations we create for the well
>     behaved
>     > JDK class loaders. It it not yet time to discuss that in detail;
>     it is
>     > time to learn the elements of our craft by working with the well
>     behaved
>     > class loaders only.
>     >
>     > The same comment applies to the observation that we might try to
>     > “auto-train” applications. That is, get rid of the CDS archive,
>     > generated by a separate training run, and just automagically run
>     the
>     > same application faster the second time, by capturing CDS-like
>     states
>     > from the first run, treating it “secretly” as a training run. We
>     know
>     > this can work well on some Java workloads. But we also like the
>     > predictability and simplicity of CDS. For HotSpot, it is not yet
>     time to
>     > work on applying our learnings with CDS to the problem of
>     auto-training.
>     > I hope that time will come after we have mined out more of the
>     basic
>     > potential of CDS. For now we are working on the “one-step
>     workflow”,
>     > where there is an explicit training phase that generates CDS. The
>     > “zero-step workflow” will comne in time.
>     >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/leyden-dev/attachments/20240115/fa63d941/attachment.htm>