<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>I think it's worth experimenting and see how much saving can be

      achieved.</p>

    <p>Although the current estimate may be small (11 ms out of 750 ms),

      it may be a code path that's difficult to optimize down the road.</p>

    <p>I think we can significantly improve over the current performance

      by shifting more Java computation to build time (e.g., making a

      heap snapshot of computed constants, running <clinit> at

      build time, etc). We should understand where the frameworks are

      doing such negative class lookups, and see if they can be time

      shifted (or otherwise avoided).</p>

    <p>If the answer is no, then I think it's worthwhile to implement a

      *runtime* negative lookup cache. As the overall start-up time goes

      down, the cost of negative class lookup will increase. For

      example, if it becomes 9 ms out of 250ms, then it will be more

      significant.</p>

    <p>Also, are you testing with "AOT" mode for Spring Petclinic --

      it's a special packaging mode where a lot of the symbolic

      information are resolved at build time, so perhaps it will have a

      much lower use of negative class lookup?<br>

    </p>

    <p><a class="moz-txt-link-freetext" href="https://github.com/openjdk/leyden/tree/premain/test/hotspot/jtreg/premain/spring-petclinic">https://github.com/openjdk/leyden/tree/premain/test/hotspot/jtreg/premain/spring-petclinic</a><br>

    </p>

    <p>Thanks</p>

    <p>- Ioi<br>

    </p>

    <p><br>

    </p>

    On 1/12/24 12:32 PM, Ashutosh Mehra wrote:<br>

    <blockquote type="cite" cite="mid:CAKt0pyRreF2JkWvRbe75OKg_jbk+5YU_4tzAPHcWN3mXkz5-xg@mail.gmail.com">

      

      <div dir="ltr">

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span style="font-family:sans-serif">Ashutosh, you and your team

            have mentioned that there are tens of milliseconds (several

            percentage points of time) consumed during startup of some

            workloads by </span><em style="font-family:sans-serif">failed</em><span style="font-family:sans-serif"> lookups</span> </blockquote>

        <div><br>

        </div>

        <div>While working on Vladimir's suggestion to

          profile JVM_FindClassFromCaller, I realized I had made a

          mistake in my earlier attempt to profile the Class.forName

          method.</div>

        <div>Sadly once I fixed that bug, the time spent in failed

          lookups is not that significant any more.</div>

        <div><br>

        </div>

        <div>This is the <a href="https://github.com/ashu-mehra/leyden/commit/0bd59831b387358b60b9f38080ff09081512679a" moz-do-not-send="true">patch</a> I have for profiling

          Class.forName at Java level. It shows the time spent in

          Class.forName for negative lookups for app class loader. </div>

        <div>For Quarkus app that I am using, the patch reports 11ms

          which is 1.4% of the startup time (of 750 ms).</div>

        <div>For Springboot-petclinic app the patch reports 36ms which

          is 1.1% of the startup time (of 3250ms).</div>

        <div><br>

        </div>

        <div>The other <a href="https://github.com/ashu-mehra/leyden/commit/3923adbb2a3e3291965dd5b85cb7a918db555117" moz-do-not-send="true">patch</a> I have is for profiling

          JVM_FindClassFromCaller when it throws an exception for the

          app classloader.</div>

        <div>

          <div>For Quarkus app the patch reports 5ms.</div>

          <div>For Springboot-petclinic app the patch reports 25ms.</div>

        </div>

        <div><br>

        </div>

        <div>Given these numbers, @JohnR do you think it is still worth

          spending time on the negative cache for the class loaders?<br>

        </div>

        <div>And sorry for reporting incorrect numbers earlier.</div>

        <div><br>

        </div>

        <div>Thanks,</div>

        <div>

          <div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">

            <div dir="ltr">- Ashutosh Mehra</div>

          </div>

        </div>

        <br>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Thu, Jan 11, 2024 at

          7:39 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com" moz-do-not-send="true" class="moz-txt-link-freetext">vladimir.x.ivanov@oracle.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

          > We know that /successful/ lookups go fast the second time

          because the VM <br>

          > caches the result in a central system dictionary. And,

          CDS technology <br>

          > makes successful lookups go fast the /first time/, if the

          lookup was <br>

          > performed in a training run and the resulting state

          stored in a CDS <br>

          > archive. (Those who watch our premain branch will see

          that there is lots <br>

          > of low-hanging fruit in CDS, that we are only beginning

          to enjoy.)<br>

          <br>

          Even though repeated successful lookups are already fast it is

          still <br>

          benefitial to optimize them. For example, class pre-loading

          and CP entry <br>

          pre-resolution are implemented in premain and do give

          noticeable startup <br>

          improvements.<br>

          <br>

          And repeated successful lookups are common when it comes to <br>

          Class.forName(). For example, PetClinic deployment run

          experiences 10k <br>

          calls into JVM_FindClassFromCaller which cost ~20ms (measured

          on M1 Pro).<br>

          <br>

          So, while negative lookup cache looks like the lowest hanging

          fruit, <br>

          it's worth to consider positive lookup caching scenario as

          well.<br>

          <br>

          Best regards,<br>

          Vladimir Ivanov<br>

          <br>

          > But, a /failed/ lookup is not recorded anywhere. So every

          distinct <br>

          > lookup must start again from first principles and fail

          all over again. <br>

          > For some workloads this costs a small but measurable

          percentage of <br>

          > startup time.<br>

          > <br>

          > The story is different for the local |CONSTANT_Class|

          entries in any <br>

          > given classfile: The JVMS mandates that both successful

          and failed <br>

          > lookups are recorded on the first attempt (per CP entry

          per se, not <br>

          > globally and not per class). Global usage includes both

          use of <br>

          > |Class.forName| and the “back end” logic for CP entry

          resolution. CP <br>

          > resolution is performed at most once per CP entry, and

          (win or lose) is <br>

          > made sticky on the CP itself, locally.<br>

          > <br>

          > To summarize, we can say that, for class lookup, both

          success and <br>

          > failure are “sticky” locally, and success is “sticky”

          globally, but <br>

          > failure is “not sticky” globally.<br>

          > <br>

          > The global behavior can be thought of either specific to

          a class loader <br>

          > (i.e., coded in JDK code) or as something in the VM or

          JNI code that <br>

          > works with the JDK code. In reality it is an emergent

          property of a <br>

          > number of small details in both.<br>

          > <br>

          > A /negative lookup cache/ is a collection of class names

          (for a given <br>

          > loader) which have already failed to load. “Sticky

          failure” could be <br>

          > implemented with a negative lookup cache, either on a

          class loader (my <br>

          > preferred solution, I think) or else somewhere in the VM

          internals that <br>

          > participate in class loading paths.<br>

          > <br>

          > The benefits are obvious: Startup could be shorter by

          tens of <br>

          > milliseconds. The eliminated operations include

          re-creating exceptions, <br>

          > and throwing and catching them, and (maybe) uselessly

          re-probing the <br>

          > file system.<br>

          > <br>

          > The risks include at least two cases. First, a user might

          somehow <br>

          > contrive to extend the class path after a failure has

          been made sticky, <br>

          > and then the user could be disappointed when a class

          appears on the new <br>

          > class path components that satisfies the load. Second, a

          user might <br>

          > somehow contrive to mutate an existing class path

          component (by writing <br>

          > a file into a directory, say), and have the same

          disappointment of not <br>

          > seeing the classfile get picked up on the next request.<br>

          > <br>

          > But it seems to me that a negative lookup cache is a

          legitimate <br>

          > optimization /for well behaved class loaders/. (Please

          check my work <br>

          > here!) The preconditions are that the well behaved class

          takes its input <br>

          > from inputs that cannot be updated after the VM has

          started running. Or, <br>

          > if and when those inputs are updated somehow, the

          negative cache must be <br>

          > invalidated, at least for classes that could possibly be

          loaded from the <br>

          > updated parts. You can sometimes reason from the package

          prefix and from <br>

          > the class path updates that some name cannot be read from

          some class <br>

          > path element, just because of a missing directory.<br>

          > <br>

          > A CDS archive records its class path, and can detect

          whether that class <br>

          > path reads only from an immutable backing store. (This is

          a sweet spot <br>

          > for Leyden.) If that is the case, then the CDS archive

          could also store <br>

          > a negative lookup cache (for each eligible class loader).

          I think this <br>

          > should be done in Java code and the relevant field and

          its data <br>

          > special-cased to be retained via CDS.<br>

          > <br>

          > (I mean “special-cased” the way we already special-case

          some other <br>

          > selected data, like the module graph and integer box

          cache. As with <br>

          > framework-defined class loaders, we may have a

          conversation in the <br>

          > future about letting user code into this little game as

          well. But it has <br>

          > to be done in a way that does not violate any

          specification, which makes <br>

          > it challenging. One step at a time.)<br>

          > <br>

          > For immediate prototyping and testing of the concept, we

          don’t need to <br>

          > bring CDS into the picture. We can just have a global

          flag that says “it <br>

          > is safe to use a negative lookup cache”. But to roll out

          this <br>

          > optimization in a product, the flag needs to be

          automatically set to a <br>

          > safe value, probably by CDS at startup, based on in

          inspection of the <br>

          > class path settings in both training and deployment runs.

          And of course <br>

          > (as a separate step) we can pre-populate the caches at

          CDS dump time <br>

          > (that is, after a training run), so that the deployed

          application can <br>

          > immediately benefit from the cache, and spend zero time

          exploring the <br>

          > class path for classes that are known to be missing.<br>

          > <br>

          > BTW, I think it is just fine to throw a pre-constructed

          exception when <br>

          > the negative lookup cache hits, even though some users

          will complain <br>

          > that such exceptions are lacking meaningful messages and

          backtraces. <br>

          > It’s within spec. HotSpot does this for certain “hot

          throws” of built-in <br>

          > exceptions; see |GraphKit::builtin_throw|, and see also

          the tricky logic <br>

          > that makes failures sticky in CP entries (which edits

          down the exception <br>

          > information). As a compromise, the negative lookup cache

          could store an <br>

          > exception object whose message is the class name (but

          with no backtrace).<br>

          > <br>

          > There’s a another way to approach this issue, which is to

          index the <br>

          > class path in such a way that class loaders can respond

          to arbitrary <br>

          > load requests but do little or no work on failing

          requests. A Bloom <br>

          > filter is sometimes used in such cases to avoid many (not

          all) of the <br>

          > searches. But I think that’s overkill for the use cases

          we actually <br>

          > observe, which is a large number of failed lookups on a

          small number of <br>

          > class names. A per-loader table mapping a name to an

          exception seems to <br>

          > be a good tradeoff. And as I noted, CDS can pre-populate

          these things <br>

          > eventually.<br>

          > <br>

          > Ashutosh, maybe you are interested in working on some of

          this? :-)<br>

          > <br>

          > — John<br>

          > <br>

          > P.S. If the negative lookup cache has the right

          “stability” properties, <br>

          > we can even ask the JIT to think about optimizing failing

          <br>

          > |Class.forName| calls, by consulting the cache at compile

          time. In the <br>

          > Leyden setting, some |Class.forName| calls (not all) can

          be <br>

          > constant-folded. Perhaps the argument is semi-constant

          and can be <br>

          > profiled and speculated. Maybe some of that pays off, or

          maybe not; <br>

          > probably not since the |forName| call is probably buried

          in a stack of <br>

          > middleware. These are ideas for the JIT team to put on

          their very long list.<br>

          > <br>

          > P.P.S. Regarding the two side issues mentioned above…<br>

          > <br>

          > We are not at all forgetting about framework-defined

          class loaders. But <br>

          > for the next few months it is enough to assume that we

          will optimize <br>

          > only class loaders which are defined by the VM+JDK

          substrate. In the <br>

          > future we will want to investigate how to make

          framework-defined loaders <br>

          > compatible with whatever optimizations we create for the

          well behaved <br>

          > JDK class loaders. It it not yet time to discuss that in

          detail; it is <br>

          > time to learn the elements of our craft by working with

          the well behaved <br>

          > class loaders only.<br>

          > <br>

          > The same comment applies to the observation that we might

          try to <br>

          > “auto-train” applications. That is, get rid of the CDS

          archive, <br>

          > generated by a separate training run, and just

          automagically run the <br>

          > same application faster the second time, by capturing

          CDS-like states <br>

          > from the first run, treating it “secretly” as a training

          run. We know <br>

          > this can work well on some Java workloads. But we also

          like the <br>

          > predictability and simplicity of CDS. For HotSpot, it is

          not yet time to <br>

          > work on applying our learnings with CDS to the problem of

          auto-training. <br>

          > I hope that time will come after we have mined out more

          of the basic <br>

          > potential of CDS. For now we are working on the “one-step

          workflow”, <br>

          > where there is an explicit training phase that generates

          CDS. The <br>

          > “zero-step workflow” will comne in time.<br>

          > <br>

          <br>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>