premain: performance sniff tests

ioi.lam at oracle.com ioi.lam at oracle.com
Wed Sep 13 15:19:19 UTC 2023


By "class loading", I was referring to the fact that an InstanceKlasss 
is parsed from Java bytecodes and made available in the system 
dictionary. At this point, <clinit> of this class is not yet executed.

In the current premain branch, we do not call <clinit> on arbitrary 
classes (only a very limited set of classes are initialized at dump 
time). We are exploring ways to extend the set of dump-time initialized 
classes. For example, we can run those that are proven to have no side 
effect.

The init-barriers we have in the AOT code today is to ensure that 
<clinit> is called properly at runtime.

Thanks

- Ioi

On 9/12/23 8:11 AM, Ashutosh Mehra wrote:
> Thanks Vladimir and Ioi for the explanation on custom classloaders.
>
> If I understand correctly, the problem with the order of loading 
> classes is not
> just limited to classloaders with side-effects, but also with clinit-s 
> with side-effects.
>
> If the program output depends on the order the classes are loaded,
> irrespective of the classloader involved, the order would need to be 
> preserved, right?
> If it is so, then why do custom classloaders pose a problem and not 
> built-in loaders.
> If having init barriers for built-in loaders is sufficient, why isn't 
> that the case for custom loaders?
> What am I missing?
>
> Thanks,
> - Ashutosh Mehra
>
>
> On Thu, Sep 7, 2023 at 10:08 PM <ioi.lam at oracle.com> wrote:
>
>     On 9/6/23 7:45 PM, Vladimir Ivanov wrote:
>     >
>     >>     There were some experiments with PetClinic on our side
>     before and
>     >> it was
>     >>     noticed that the application relies on custom loaders which
>     >> aren't fully
>     >>     supported yet.
>     >>
>     >>
>     >> Can you please elaborate more about the support required for
>     handling
>     >> custom classloaders.
>     >> Do they have an impact on AOT code quality or the training data?
>     >
>     > As of now, there are a number of implementation-specific
>     constraints
>     > imposed to simplify prototyping and experiments. I'm not sure about
>     > training data, but code caching is conservatively disabled for all
>     > classes loaded by custom loaders. Also, some recent CDS-specific
>     > enhancements in Leyden repository (like class preloading, constant
>     > pool entries pre-resolution) are disabled for custom loaders.
>     So, as
>     > of today, code loaded by custom loaders doesn't benefit much
>     from the
>     > work in Leyden.
>     >
>     >>     Until proper support for custom loaders is there, I suggest to
>     >> modify
>     >>     the benchmark so it relies only on existing system loaders.
>     >>
>     >>
>     >> Is there ongoing work to improve the support for custom loaders?
>     >
>     > I'll let Ioi to comment on the plans about custom loaders. But I
>     > wouldn't expect much progress on that front in the short term.
>     >
>     TL/DR; in the near term, we can probably support only a limited
>     set of
>     custom class loader types, if at all.
>
>     The main problem with custom class loaders is they can have side
>     effects
>     that are observable at the Java level, so strictly we can't even
>     change
>     the order of invocations to ClassLoader.loadClass(). Otherwise we may
>     change the meaning of the program. E.g.,
>
>          public static class MyLoader extends URLClassLoader {
>              static int x;
>              public MyLoader(URL[] urls, ClassLoader parent) {
>                  super(urls, parent);
>              }
>
>               protected Class<?> loadClass(String name, boolean resolve)
>     throws ClassNotFoundException {
>                  x <<= 1;
>                  x += name.hashCode();
>                  System.out.println(x);
>                  return super.loadClass(name, resolve);
>              }
>          }
>
>     If our optimizations change the order of class loading, you will have
>     unexpected output in stdout.
>
>     So if we want to time shift some computation that uses code loaded by
>     MyLoader, how can we authentically preserve all the side effects and
>     replay them in the production run?
>
>     Even if we remove the I/O from the example above, the value
>     MyLoader.x
>     is still observable by other Java code. So how do we make sure
>     MyLoader.x has the correct value at every observation point in the
>     production run?
>
>     For example, assume we have a "constant" expression that we want
>     to fold:
>
>          class A { static int m() { return 1; }
>
>          class B { static int n() { return 2; }
>
>          class C {
>
>              static final int x = A.m() + B.n();
>
>          }
>
>     We not only need to remember "C.x is constant-folded to 3", but
>     also --
>     C.<clinit> will first load A and then B.
>
>     This means we have to keep C.<clinit> around (even though it has an
>     empty body as we removed the computation itself). We must trigger
>     calls
>     to C.<clinit> at all static references to C, in order to trigger the
>     loading of A and B in the correct order.
>
>     As a result, all the AOT code that makes a static reference to C must
>     take a class initialization barrier.
>
>     So you can see how complexity can quickly get out of hand. For
>     example,
>     we can't constant fold C.x in the AOT code because we need to
>     preserve
>     the class init barrier.
>
>     If you pre-compute a set of objects that include instances of C, the
>     situation becomes even worse. You not only need to remember the
>     values
>     of these objects, but also the sequence of how they were
>     constructed so
>     you can replay the correct sequence of class loading  ....
>
>     ==============
>
>     For loaders that are side-effect free at the Java level
>     (URLClassLoader??), maybe the story is simpler.
>
>     Thanks
>
>     - Ioi
>
>
>     >> Another thing that I want to check is the portability of the
>     AOT code.
>     >> Do we do anything to ensure the AOT code is portable across
>     >> microarchitectures,
>     >> that is, it is not tied to the CPU features of the system where
>     the
>     >> code is being generated.
>     >> If we bundle the cached code archive in containers, which I expect
>     >> would be one of the ways to deploy these archives,
>     >> then the portability would come into picture.
>     >
>     > We plan to address that in the future, but, so far, JVM does
>     nothing
>     > special in that respect. Moreover, CDS archive itself imposes
>     > constraints on supported JVM modes (e.g., compressed oops and
>     > compressed class pointer modes should match). But users are free to
>     > specify any additional constraints during training process. For
>     > example, if shared code archive is generated with -XX:UseAVX=2,
>     there
>     > won't be any AVX512 instructions present in the archived code which
>     > makes it safe to run on any AVX2-capable hardware.
>     >
>     > Best regards,
>     > Vladimir Ivanov
>     >
>     >> On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov
>     >> <vladimir.x.ivanov at oracle.com
>     <mailto:vladimir.x.ivanov at oracle.com>>
>     >> wrote:
>     >>
>     >>     Hi Ashutosh,
>     >>
>     >>     Thanks for giving it a try!
>     >>
>     >>     There were some experiments with PetClinic on our side
>     before and it
>     >>     was
>     >>     noticed that the application relies on custom loaders which
>     aren't
>     >>     fully
>     >>     supported yet. It was the main limiting factor for new
>     >> optimizations.
>     >>     Until proper support for custom loaders is there, I suggest to
>     >> modify
>     >>     the benchmark so it relies only on existing system loaders.
>     >>
>     >>     Speaking of peak performance, some loss of performance is
>     expected.
>     >>     Cached code is compiled conservatively (e.g., no constant
>     folding
>     >> for
>     >>     static final fields) so it can be reused in deployment
>     runs. For
>     >> now,
>     >>     the intended solution is to eventually recompile cached
>     code online
>     >>     with
>     >>     all the optimizations enabled (have to be explicitly enabled
>     >>     -XX:+UseRecompilation). It's a work-in-progress and our
>     experience
>     >>     using
>     >>     it was mixed: recompilation doesn't always fully restore peak
>     >>     performance.
>     >>
>     >>     But assuming that both CDS and cached code archive are
>     underutilized
>     >>     (due to aforementioned reliance on custom loaders), 10% sounds
>     >> way too
>     >>     big of a difference. I suggest to experiment with different
>     flag
>     >>     combinations (e.g., turning ReplayTraining and
>     LoadCachedCode on and
>     >>     off
>     >>     independently).
>     >>
>     >>     There's additional diagnostic output JVM produces which may
>     help to
>     >>     observe effects from new optimizations during both training and
>     >>     deployment runs:
>     >>
>     >>        * -XX:+PrintCompilation: compilations satisfied from
>     cached code
>     >>     archive are marked w/ "R";
>     >>
>     >>        * -XX:+CITime:  prints information about cached code
>     archive
>     >> usage;
>     >>
>     >>        * -Xlog:init=info: produces additional information about
>     some
>     >>     startup
>     >>     activities
>     >>
>     >>        * -XX:+PrintSharedArchiveAndExit additionally dumps
>     training data
>     >>     and
>     >>     cached code archive info
>     >>
>     >>        * -Xlog:scc*=info and -Xlog:cds*=info print lots of
>     additional
>     >>     information both during training and deployment
>     >>
>     >>     Hope it helps.
>     >>
>     >>     Best regards,
>     >>     Vladimir Ivanov
>     >>
>     >>     On 9/5/23 13:52, Ashutosh Mehra wrote:
>     >>      > Hi,
>     >>      >
>     >>      > We have been interested in persisting the profiling data in
>     >> the CDS
>     >>      > archive with the intention of improving the
>     application's warmup
>     >>     time.
>     >>      > And now that the premain branch is here that does save
>     profile
>     >> data
>     >>      > along with AOT, we started playing with the premain
>     branch to
>     >>     understand
>     >>      > its impact on the performance.
>     >>      >
>     >>      > Our setup uses Springboot Petclinic [0] application and the
>     >> CDS and
>     >>      > shared code archives are generated in a manner similar
>     to this
>     >>     script [1].
>     >>      > Our training run only covers the application startup
>     phase. That
>     >>     means
>     >>      > at each step we start the application and shut it down
>     without
>     >>     putting
>     >>      > any load on it.
>     >>      >
>     >>      > Using the archives thus generated I have done few
>     experiments
>     >> on my
>     >>      > local system. In these experiments the application is
>     bound to
>     >>     two cpus.
>     >>      > The baseline for comparing the results is the case where
>     the CDS
>     >>     archive
>     >>      > does not have any profiling data and there is no shared
>     code
>     >> archive.
>     >>      > The "premain" configuration refers to using a shared
>     code archive
>     >>     and a
>     >>      > CDS archive with training data.
>     >>      >
>     >>      > Here are some initial results:
>     >>      >
>     >>      > 1. Startup: It is heartening to see start-up time improve by
>     >>     almost 11%.
>     >>      >
>     >>      > baseline       10.2s
>     >>      > premain         9.1s
>     >>      >
>     >>      > 2. Warmup:
>     >>      > This test measures the warmup time by applying load using 1
>     >> jmeter
>     >>      > thread to get an idea of the ramp-up time to reach the peak
>     >>     throughput.
>     >>      > The load is applied for the duration of 300 seconds. The
>     graph
>     >>     [2] for
>     >>      > aot+profiling configuration shows interesting behavior.
>     >>      > In the initial period premain is ramping up faster than the
>     >>     baseline.
>     >>      > Then the slope of the curve for premain reduces
>     significantly
>     >> and a
>     >>      > couple of dips are also seen. Finally the throughput
>     stabilizes.
>     >>      > It shows a drastic difference in the warmup time of the
>     >>     application when
>     >>      > running with the "premain" config.
>     >>      >
>     >>      > 3. Peak throughput: Last experiment is to measure peak
>     >>     throughput. It
>     >>      > starts with a warm-up phase of 180 seconds using 1
>     jmeter thread.
>     >>     After
>     >>      > the warmup phase the load is applied with 10 jmeter
>     threads for a
>     >>      > duration of 5 mins.
>     >>      > Last two minutes of throughput is considered for
>     measurement. The
>     >>     graph
>     >>      > [3] for this test shows almost a 10% drop in the throughput
>     >>     compared to
>     >>      > the baseline.
>     >>      >
>     >>      >
>     >>      > I am sure others would have done similar testing.  My
>     >> questions are:
>     >>      >
>     >>      > 1. Are these results on the expected lines?
>     >>      > 2. Are these tests using the CDS and the shared code (or
>     cached
>     >>     code)
>     >>      > archives in the expected manner.
>     >>      > 3. Warmup time with the premain branch looks pretty bad
>     which is
>     >>      > surprising. Is there any trick I missed in my tests? Is
>     there
>     >>     anything
>     >>      > else that needs to be done to get better warmup time?
>     >>      > 4. What is the point of creating a new static archive?
>     >> Shouldn't the
>     >>      > applications just create the dynamic archive?
>     >>      > 5. I am also wondering if there is any design doc that
>     can be
>     >> shared
>     >>      > that explains the AOT compilation strategy adopted in
>     the premain
>     >>     branch?
>     >>      >
>     >>      > I have placed my scripts here [4] in case anyone wants
>     to use
>     >>     them to
>     >>      > run these tests (you need to build the Petclinic app
>     before using
>     >>     these
>     >>      > scripts).
>     >>      >
>     >>      > Please feel free to share your thoughts.
>     >>      >
>     >>      > [0] https://github.com/spring-projects/spring-petclinic
>     <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XUaP61Xw$>
>     >>
>     <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>
>     >>      > <https://github.com/spring-projects/spring-petclinic
>     <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XUaP61Xw$>
>     >>
>     <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>>
>     >>      > [1]
>     >>      >
>     >>
>     https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101
>     <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XzhNGQpA$>
>
>     >>
>     <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>
>
>     >>
>     <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101
>     <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XzhNGQpA$>
>
>     >>
>     <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>>
>
>     >>
>     >>      > [2]
>     >>      >
>     >>
>     https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5V9byR7rA$>
>
>     >>
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>
>
>     >>
>     <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5V9byR7rA$>
>
>     >>
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>>
>
>     >>
>     >>      > [3]
>     >>      >
>     >>
>     https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5UHm1Xhgg$>
>
>     >>
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>
>
>     >>
>     <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5UHm1Xhgg$>
>
>     >>
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>>
>
>     >>
>     >>      > [4] https://github.com/ashu-mehra/leyden-perf
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5WezIkxBw$>
>     >>
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>
>     >>      > <https://github.com/ashu-mehra/leyden-perf
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5WezIkxBw$>
>     >>
>     <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>>
>     >>      >
>     >>      > Thanks,
>     >>      > - Ashutosh Mehra
>     >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/leyden-dev/attachments/20230913/fb061b15/attachment.htm>


More information about the leyden-dev mailing list