premain: performance sniff tests
ioi.lam at oracle.com
ioi.lam at oracle.com
Wed Sep 13 15:19:19 UTC 2023
By "class loading", I was referring to the fact that an InstanceKlasss
is parsed from Java bytecodes and made available in the system
dictionary. At this point, <clinit> of this class is not yet executed.
In the current premain branch, we do not call <clinit> on arbitrary
classes (only a very limited set of classes are initialized at dump
time). We are exploring ways to extend the set of dump-time initialized
classes. For example, we can run those that are proven to have no side
effect.
The init-barriers we have in the AOT code today is to ensure that
<clinit> is called properly at runtime.
Thanks
- Ioi
On 9/12/23 8:11 AM, Ashutosh Mehra wrote:
> Thanks Vladimir and Ioi for the explanation on custom classloaders.
>
> If I understand correctly, the problem with the order of loading
> classes is not
> just limited to classloaders with side-effects, but also with clinit-s
> with side-effects.
>
> If the program output depends on the order the classes are loaded,
> irrespective of the classloader involved, the order would need to be
> preserved, right?
> If it is so, then why do custom classloaders pose a problem and not
> built-in loaders.
> If having init barriers for built-in loaders is sufficient, why isn't
> that the case for custom loaders?
> What am I missing?
>
> Thanks,
> - Ashutosh Mehra
>
>
> On Thu, Sep 7, 2023 at 10:08 PM <ioi.lam at oracle.com> wrote:
>
> On 9/6/23 7:45 PM, Vladimir Ivanov wrote:
> >
> >> There were some experiments with PetClinic on our side
> before and
> >> it was
> >> noticed that the application relies on custom loaders which
> >> aren't fully
> >> supported yet.
> >>
> >>
> >> Can you please elaborate more about the support required for
> handling
> >> custom classloaders.
> >> Do they have an impact on AOT code quality or the training data?
> >
> > As of now, there are a number of implementation-specific
> constraints
> > imposed to simplify prototyping and experiments. I'm not sure about
> > training data, but code caching is conservatively disabled for all
> > classes loaded by custom loaders. Also, some recent CDS-specific
> > enhancements in Leyden repository (like class preloading, constant
> > pool entries pre-resolution) are disabled for custom loaders.
> So, as
> > of today, code loaded by custom loaders doesn't benefit much
> from the
> > work in Leyden.
> >
> >> Until proper support for custom loaders is there, I suggest to
> >> modify
> >> the benchmark so it relies only on existing system loaders.
> >>
> >>
> >> Is there ongoing work to improve the support for custom loaders?
> >
> > I'll let Ioi to comment on the plans about custom loaders. But I
> > wouldn't expect much progress on that front in the short term.
> >
> TL/DR; in the near term, we can probably support only a limited
> set of
> custom class loader types, if at all.
>
> The main problem with custom class loaders is they can have side
> effects
> that are observable at the Java level, so strictly we can't even
> change
> the order of invocations to ClassLoader.loadClass(). Otherwise we may
> change the meaning of the program. E.g.,
>
> public static class MyLoader extends URLClassLoader {
> static int x;
> public MyLoader(URL[] urls, ClassLoader parent) {
> super(urls, parent);
> }
>
> protected Class<?> loadClass(String name, boolean resolve)
> throws ClassNotFoundException {
> x <<= 1;
> x += name.hashCode();
> System.out.println(x);
> return super.loadClass(name, resolve);
> }
> }
>
> If our optimizations change the order of class loading, you will have
> unexpected output in stdout.
>
> So if we want to time shift some computation that uses code loaded by
> MyLoader, how can we authentically preserve all the side effects and
> replay them in the production run?
>
> Even if we remove the I/O from the example above, the value
> MyLoader.x
> is still observable by other Java code. So how do we make sure
> MyLoader.x has the correct value at every observation point in the
> production run?
>
> For example, assume we have a "constant" expression that we want
> to fold:
>
> class A { static int m() { return 1; }
>
> class B { static int n() { return 2; }
>
> class C {
>
> static final int x = A.m() + B.n();
>
> }
>
> We not only need to remember "C.x is constant-folded to 3", but
> also --
> C.<clinit> will first load A and then B.
>
> This means we have to keep C.<clinit> around (even though it has an
> empty body as we removed the computation itself). We must trigger
> calls
> to C.<clinit> at all static references to C, in order to trigger the
> loading of A and B in the correct order.
>
> As a result, all the AOT code that makes a static reference to C must
> take a class initialization barrier.
>
> So you can see how complexity can quickly get out of hand. For
> example,
> we can't constant fold C.x in the AOT code because we need to
> preserve
> the class init barrier.
>
> If you pre-compute a set of objects that include instances of C, the
> situation becomes even worse. You not only need to remember the
> values
> of these objects, but also the sequence of how they were
> constructed so
> you can replay the correct sequence of class loading ....
>
> ==============
>
> For loaders that are side-effect free at the Java level
> (URLClassLoader??), maybe the story is simpler.
>
> Thanks
>
> - Ioi
>
>
> >> Another thing that I want to check is the portability of the
> AOT code.
> >> Do we do anything to ensure the AOT code is portable across
> >> microarchitectures,
> >> that is, it is not tied to the CPU features of the system where
> the
> >> code is being generated.
> >> If we bundle the cached code archive in containers, which I expect
> >> would be one of the ways to deploy these archives,
> >> then the portability would come into picture.
> >
> > We plan to address that in the future, but, so far, JVM does
> nothing
> > special in that respect. Moreover, CDS archive itself imposes
> > constraints on supported JVM modes (e.g., compressed oops and
> > compressed class pointer modes should match). But users are free to
> > specify any additional constraints during training process. For
> > example, if shared code archive is generated with -XX:UseAVX=2,
> there
> > won't be any AVX512 instructions present in the archived code which
> > makes it safe to run on any AVX2-capable hardware.
> >
> > Best regards,
> > Vladimir Ivanov
> >
> >> On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov
> >> <vladimir.x.ivanov at oracle.com
> <mailto:vladimir.x.ivanov at oracle.com>>
> >> wrote:
> >>
> >> Hi Ashutosh,
> >>
> >> Thanks for giving it a try!
> >>
> >> There were some experiments with PetClinic on our side
> before and it
> >> was
> >> noticed that the application relies on custom loaders which
> aren't
> >> fully
> >> supported yet. It was the main limiting factor for new
> >> optimizations.
> >> Until proper support for custom loaders is there, I suggest to
> >> modify
> >> the benchmark so it relies only on existing system loaders.
> >>
> >> Speaking of peak performance, some loss of performance is
> expected.
> >> Cached code is compiled conservatively (e.g., no constant
> folding
> >> for
> >> static final fields) so it can be reused in deployment
> runs. For
> >> now,
> >> the intended solution is to eventually recompile cached
> code online
> >> with
> >> all the optimizations enabled (have to be explicitly enabled
> >> -XX:+UseRecompilation). It's a work-in-progress and our
> experience
> >> using
> >> it was mixed: recompilation doesn't always fully restore peak
> >> performance.
> >>
> >> But assuming that both CDS and cached code archive are
> underutilized
> >> (due to aforementioned reliance on custom loaders), 10% sounds
> >> way too
> >> big of a difference. I suggest to experiment with different
> flag
> >> combinations (e.g., turning ReplayTraining and
> LoadCachedCode on and
> >> off
> >> independently).
> >>
> >> There's additional diagnostic output JVM produces which may
> help to
> >> observe effects from new optimizations during both training and
> >> deployment runs:
> >>
> >> * -XX:+PrintCompilation: compilations satisfied from
> cached code
> >> archive are marked w/ "R";
> >>
> >> * -XX:+CITime: prints information about cached code
> archive
> >> usage;
> >>
> >> * -Xlog:init=info: produces additional information about
> some
> >> startup
> >> activities
> >>
> >> * -XX:+PrintSharedArchiveAndExit additionally dumps
> training data
> >> and
> >> cached code archive info
> >>
> >> * -Xlog:scc*=info and -Xlog:cds*=info print lots of
> additional
> >> information both during training and deployment
> >>
> >> Hope it helps.
> >>
> >> Best regards,
> >> Vladimir Ivanov
> >>
> >> On 9/5/23 13:52, Ashutosh Mehra wrote:
> >> > Hi,
> >> >
> >> > We have been interested in persisting the profiling data in
> >> the CDS
> >> > archive with the intention of improving the
> application's warmup
> >> time.
> >> > And now that the premain branch is here that does save
> profile
> >> data
> >> > along with AOT, we started playing with the premain
> branch to
> >> understand
> >> > its impact on the performance.
> >> >
> >> > Our setup uses Springboot Petclinic [0] application and the
> >> CDS and
> >> > shared code archives are generated in a manner similar
> to this
> >> script [1].
> >> > Our training run only covers the application startup
> phase. That
> >> means
> >> > at each step we start the application and shut it down
> without
> >> putting
> >> > any load on it.
> >> >
> >> > Using the archives thus generated I have done few
> experiments
> >> on my
> >> > local system. In these experiments the application is
> bound to
> >> two cpus.
> >> > The baseline for comparing the results is the case where
> the CDS
> >> archive
> >> > does not have any profiling data and there is no shared
> code
> >> archive.
> >> > The "premain" configuration refers to using a shared
> code archive
> >> and a
> >> > CDS archive with training data.
> >> >
> >> > Here are some initial results:
> >> >
> >> > 1. Startup: It is heartening to see start-up time improve by
> >> almost 11%.
> >> >
> >> > baseline 10.2s
> >> > premain 9.1s
> >> >
> >> > 2. Warmup:
> >> > This test measures the warmup time by applying load using 1
> >> jmeter
> >> > thread to get an idea of the ramp-up time to reach the peak
> >> throughput.
> >> > The load is applied for the duration of 300 seconds. The
> graph
> >> [2] for
> >> > aot+profiling configuration shows interesting behavior.
> >> > In the initial period premain is ramping up faster than the
> >> baseline.
> >> > Then the slope of the curve for premain reduces
> significantly
> >> and a
> >> > couple of dips are also seen. Finally the throughput
> stabilizes.
> >> > It shows a drastic difference in the warmup time of the
> >> application when
> >> > running with the "premain" config.
> >> >
> >> > 3. Peak throughput: Last experiment is to measure peak
> >> throughput. It
> >> > starts with a warm-up phase of 180 seconds using 1
> jmeter thread.
> >> After
> >> > the warmup phase the load is applied with 10 jmeter
> threads for a
> >> > duration of 5 mins.
> >> > Last two minutes of throughput is considered for
> measurement. The
> >> graph
> >> > [3] for this test shows almost a 10% drop in the throughput
> >> compared to
> >> > the baseline.
> >> >
> >> >
> >> > I am sure others would have done similar testing. My
> >> questions are:
> >> >
> >> > 1. Are these results on the expected lines?
> >> > 2. Are these tests using the CDS and the shared code (or
> cached
> >> code)
> >> > archives in the expected manner.
> >> > 3. Warmup time with the premain branch looks pretty bad
> which is
> >> > surprising. Is there any trick I missed in my tests? Is
> there
> >> anything
> >> > else that needs to be done to get better warmup time?
> >> > 4. What is the point of creating a new static archive?
> >> Shouldn't the
> >> > applications just create the dynamic archive?
> >> > 5. I am also wondering if there is any design doc that
> can be
> >> shared
> >> > that explains the AOT compilation strategy adopted in
> the premain
> >> branch?
> >> >
> >> > I have placed my scripts here [4] in case anyone wants
> to use
> >> them to
> >> > run these tests (you need to build the Petclinic app
> before using
> >> these
> >> > scripts).
> >> >
> >> > Please feel free to share your thoughts.
> >> >
> >> > [0] https://github.com/spring-projects/spring-petclinic
> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XUaP61Xw$>
> >>
> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>
> >> > <https://github.com/spring-projects/spring-petclinic
> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XUaP61Xw$>
> >>
> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>>
> >> > [1]
> >> >
> >>
> https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101
> <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XzhNGQpA$>
>
> >>
> <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>
>
> >>
> <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101
> <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XzhNGQpA$>
>
> >>
> <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>>
>
> >>
> >> > [2]
> >> >
> >>
> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5V9byR7rA$>
>
> >>
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>
>
> >>
> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5V9byR7rA$>
>
> >>
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>>
>
> >>
> >> > [3]
> >> >
> >>
> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5UHm1Xhgg$>
>
> >>
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>
>
> >>
> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5UHm1Xhgg$>
>
> >>
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>>
>
> >>
> >> > [4] https://github.com/ashu-mehra/leyden-perf
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5WezIkxBw$>
> >>
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>
> >> > <https://github.com/ashu-mehra/leyden-perf
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5WezIkxBw$>
> >>
> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>>
> >> >
> >> > Thanks,
> >> > - Ashutosh Mehra
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/leyden-dev/attachments/20230913/fb061b15/attachment.htm>
More information about the leyden-dev
mailing list