premain: performance sniff tests

Tue Sep 12 15:11:41 UTC 2023

Thanks Vladimir and Ioi for the explanation on custom classloaders.

If I understand correctly, the problem with the order of loading classes is
not
just limited to classloaders with side-effects, but also with clinit-s with
side-effects.

If the program output depends on the order the classes are loaded,
irrespective of the classloader involved, the order would need to be
preserved, right?
If it is so, then why do custom classloaders pose a problem and not
built-in loaders.
If having init barriers for built-in loaders is sufficient, why isn't that
the case for custom loaders?
What am I missing?

Thanks,
- Ashutosh Mehra

On Thu, Sep 7, 2023 at 10:08 PM <ioi.lam at oracle.com> wrote:

> On 9/6/23 7:45 PM, Vladimir Ivanov wrote:
> >
> >>     There were some experiments with PetClinic on our side before and
> >> it was
> >>     noticed that the application relies on custom loaders which
> >> aren't fully
> >>     supported yet.
> >>
> >>
> >> Can you please elaborate more about the support required for handling
> >> custom classloaders.
> >> Do they have an impact on AOT code quality or the training data?
> >
> > As of now, there are a number of implementation-specific constraints
> > imposed to simplify prototyping and experiments. I'm not sure about
> > training data, but code caching is conservatively disabled for all
> > classes loaded by custom loaders. Also, some recent CDS-specific
> > enhancements in Leyden repository (like class preloading, constant
> > pool entries pre-resolution) are disabled for custom loaders. So, as
> > of today, code loaded by custom loaders doesn't benefit much from the
> > work in Leyden.
> >
> >>     Until proper support for custom loaders is there, I suggest to
> >> modify
> >>     the benchmark so it relies only on existing system loaders.
> >>
> >>
> >> Is there ongoing work to improve the support for custom loaders?
> >
> > I'll let Ioi to comment on the plans about custom loaders. But I
> > wouldn't expect much progress on that front in the short term.
> >
> TL/DR; in the near term, we can probably support only a limited set of
> custom class loader types, if at all.
>
> The main problem with custom class loaders is they can have side effects
> that are observable at the Java level, so strictly we can't even change
> the order of invocations to ClassLoader.loadClass(). Otherwise we may
> change the meaning of the program. E.g.,
>
>      public static class MyLoader extends URLClassLoader {
>          static int x;
>          public MyLoader(URL[] urls, ClassLoader parent) {
>              super(urls, parent);
>          }
>
>           protected Class<?> loadClass(String name, boolean resolve)
> throws ClassNotFoundException {
>              x <<= 1;
>              x += name.hashCode();
>              System.out.println(x);
>              return super.loadClass(name, resolve);
>          }
>      }
>
> If our optimizations change the order of class loading, you will have
> unexpected output in stdout.
>
> So if we want to time shift some computation that uses code loaded by
> MyLoader, how can we authentically preserve all the side effects and
> replay them in the production run?
>
> Even if we remove the I/O from the example above, the value MyLoader.x
> is still observable by other Java code. So how do we make sure
> MyLoader.x has the correct value at every observation point in the
> production run?
>
> For example, assume we have a "constant" expression that we want to fold:
>
>      class A { static int m() { return 1; }
>
>      class B { static int n() { return 2; }
>
>      class C {
>
>          static final int x = A.m() + B.n();
>
>      }
>
> We not only need to remember "C.x is constant-folded to 3", but also --
> C.<clinit> will first load A and then B.
>
> This means we have to keep C.<clinit> around (even though it has an
> empty body as we removed the computation itself). We must trigger calls
> to C.<clinit> at all static references to C, in order to trigger the
> loading of A and B in the correct order.
>
> As a result, all the AOT code that makes a static reference to C must
> take a class initialization barrier.
>
> So you can see how complexity can quickly get out of hand. For example,
> we can't constant fold C.x in the AOT code because we need to preserve
> the class init barrier.
>
> If you pre-compute a set of objects that include instances of C, the
> situation becomes even worse. You not only need to remember the values
> of these objects, but also the sequence of how they were constructed so
> you can replay the correct sequence of class loading  ....
>
> ==============
>
> For loaders that are side-effect free at the Java level
> (URLClassLoader??), maybe the story is simpler.
>
> Thanks
>
> - Ioi
>
>
> >> Another thing that I want to check is the portability of the AOT code.
> >> Do we do anything to ensure the AOT code is portable across
> >> microarchitectures,
> >> that is, it is not tied to the CPU features of the system where the
> >> code is being generated.
> >> If we bundle the cached code archive in containers, which I expect
> >> would be one of the ways to deploy these archives,
> >> then the portability would come into picture.
> >
> > We plan to address that in the future, but, so far, JVM does nothing
> > special in that respect. Moreover, CDS archive itself imposes
> > constraints on supported JVM modes (e.g., compressed oops and
> > compressed class pointer modes should match). But users are free to
> > specify any additional constraints during training process. For
> > example, if shared code archive is generated with -XX:UseAVX=2, there
> > won't be any AVX512 instructions present in the archived code which
> > makes it safe to run on any AVX2-capable hardware.
> >
> > Best regards,
> > Vladimir Ivanov
> >
> >> On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov
> >> <vladimir.x.ivanov at oracle.com <mailto:vladimir.x.ivanov at oracle.com>>
> >> wrote:
> >>
> >>     Hi Ashutosh,
> >>
> >>     Thanks for giving it a try!
> >>
> >>     There were some experiments with PetClinic on our side before and it
> >>     was
> >>     noticed that the application relies on custom loaders which aren't
> >>     fully
> >>     supported yet. It was the main limiting factor for new
> >> optimizations.
> >>     Until proper support for custom loaders is there, I suggest to
> >> modify
> >>     the benchmark so it relies only on existing system loaders.
> >>
> >>     Speaking of peak performance, some loss of performance is expected.
> >>     Cached code is compiled conservatively (e.g., no constant folding
> >> for
> >>     static final fields) so it can be reused in deployment runs. For
> >> now,
> >>     the intended solution is to eventually recompile cached code online
> >>     with
> >>     all the optimizations enabled (have to be explicitly enabled
> >>     -XX:+UseRecompilation). It's a work-in-progress and our experience
> >>     using
> >>     it was mixed: recompilation doesn't always fully restore peak
> >>     performance.
> >>
> >>     But assuming that both CDS and cached code archive are underutilized
> >>     (due to aforementioned reliance on custom loaders), 10% sounds
> >> way too
> >>     big of a difference. I suggest to experiment with different flag
> >>     combinations (e.g., turning ReplayTraining and LoadCachedCode on and
> >>     off
> >>     independently).
> >>
> >>     There's additional diagnostic output JVM produces which may help to
> >>     observe effects from new optimizations during both training and
> >>     deployment runs:
> >>
> >>        * -XX:+PrintCompilation: compilations satisfied from cached code
> >>     archive are marked w/ "R";
> >>
> >>        * -XX:+CITime:  prints information about cached code archive
> >> usage;
> >>
> >>        * -Xlog:init=info: produces additional information about some
> >>     startup
> >>     activities
> >>
> >>        * -XX:+PrintSharedArchiveAndExit additionally dumps training data
> >>     and
> >>     cached code archive info
> >>
> >>        * -Xlog:scc*=info and -Xlog:cds*=info print lots of additional
> >>     information both during training and deployment
> >>
> >>     Hope it helps.
> >>
> >>     Best regards,
> >>     Vladimir Ivanov
> >>
> >>     On 9/5/23 13:52, Ashutosh Mehra wrote:
> >>      > Hi,
> >>      >
> >>      > We have been interested in persisting the profiling data in
> >> the CDS
> >>      > archive with the intention of improving the application's warmup
> >>     time.
> >>      > And now that the premain branch is here that does save profile
> >> data
> >>      > along with AOT, we started playing with the premain branch to
> >>     understand
> >>      > its impact on the performance.
> >>      >
> >>      > Our setup uses Springboot Petclinic [0] application and the
> >> CDS and
> >>      > shared code archives are generated in a manner similar to this
> >>     script [1].
> >>      > Our training run only covers the application startup phase. That
> >>     means
> >>      > at each step we start the application and shut it down without
> >>     putting
> >>      > any load on it.
> >>      >
> >>      > Using the archives thus generated I have done few experiments
> >> on my
> >>      > local system. In these experiments the application is bound to
> >>     two cpus.
> >>      > The baseline for comparing the results is the case where the CDS
> >>     archive
> >>      > does not have any profiling data and there is no shared code
> >> archive.
> >>      > The "premain" configuration refers to using a shared code archive
> >>     and a
> >>      > CDS archive with training data.
> >>      >
> >>      > Here are some initial results:
> >>      >
> >>      > 1. Startup: It is heartening to see start-up time improve by
> >>     almost 11%.
> >>      >
> >>      > baseline       10.2s
> >>      > premain         9.1s
> >>      >
> >>      > 2. Warmup:
> >>      > This test measures the warmup time by applying load using 1
> >> jmeter
> >>      > thread to get an idea of the ramp-up time to reach the peak
> >>     throughput.
> >>      > The load is applied for the duration of 300 seconds. The graph
> >>     [2] for
> >>      > aot+profiling configuration shows interesting behavior.
> >>      > In the initial period premain is ramping up faster than the
> >>     baseline.
> >>      > Then the slope of the curve for premain reduces significantly
> >> and a
> >>      > couple of dips are also seen. Finally the throughput stabilizes.
> >>      > It shows a drastic difference in the warmup time of the
> >>     application when
> >>      > running with the "premain" config.
> >>      >
> >>      > 3. Peak throughput: Last experiment is to measure peak
> >>     throughput. It
> >>      > starts with a warm-up phase of 180 seconds using 1 jmeter thread.
> >>     After
> >>      > the warmup phase the load is applied with 10 jmeter threads for a
> >>      > duration of 5 mins.
> >>      > Last two minutes of throughput is considered for measurement. The
> >>     graph
> >>      > [3] for this test shows almost a 10% drop in the throughput
> >>     compared to
> >>      > the baseline.
> >>      >
> >>      >
> >>      > I am sure others would have done similar testing.  My
> >> questions are:
> >>      >
> >>      > 1. Are these results on the expected lines?
> >>      > 2. Are these tests using the CDS and the shared code (or cached
> >>     code)
> >>      > archives in the expected manner.
> >>      > 3. Warmup time with the premain branch looks pretty bad which is
> >>      > surprising. Is there any trick I missed in my tests? Is there
> >>     anything
> >>      > else that needs to be done to get better warmup time?
> >>      > 4. What is the point of creating a new static archive?
> >> Shouldn't the
> >>      > applications just create the dynamic archive?
> >>      > 5. I am also wondering if there is any design doc that can be
> >> shared
> >>      > that explains the AOT compilation strategy adopted in the premain
> >>     branch?
> >>      >
> >>      > I have placed my scripts here [4] in case anyone wants to use
> >>     them to
> >>      > run these tests (you need to build the Petclinic app before using
> >>     these
> >>      > scripts).
> >>      >
> >>      > Please feel free to share your thoughts.
> >>      >
> >>      > [0] https://github.com/spring-projects/spring-petclinic
> >> <
> https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$
> >
> >>      > <https://github.com/spring-projects/spring-petclinic
> >> <
> https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$
> >>
> >>      > [1]
> >>      >
> >>
> https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101
> >> <
> https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>
>
> >> <
> https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101
> >> <
> https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>>
>
> >>
> >>      > [2]
> >>      >
> >>
> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg
> >> <
> https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>
>
> >> <
> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg
> >> <
> https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>>
>
> >>
> >>      > [3]
> >>      >
> >>
> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg
> >> <
> https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>
>
> >> <
> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg
> >> <
> https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>>
>
> >>
> >>      > [4] https://github.com/ashu-mehra/leyden-perf
> >> <
> https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$
> >
> >>      > <https://github.com/ashu-mehra/leyden-perf
> >> <
> https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$
> >>
> >>      >
> >>      > Thanks,
> >>      > - Ashutosh Mehra
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/leyden-dev/attachments/20230912/23cbd522/attachment.htm>