premain: performance sniff tests
Andrew Dinn
adinn at redhat.com
Thu Sep 14 10:48:53 UTC 2023
Just to note: One of the key things a custom classloader can decide at
runtime is to which loader it wants to delegate responsiblity for
loading class bytes. This is arguably the most *pertinent* run-time
dependent side-effect that you might bake in at build time, thereby
invalidating expected behaviour. It's not the most significant run-time
dependent side-effect, of course, because they are all equally significant.
regards,
Andrew Dinn
-----------
On 08/09/2023 03:08, ioi.lam at oracle.com wrote:
> On 9/6/23 7:45 PM, Vladimir Ivanov wrote:
>>
>>> There were some experiments with PetClinic on our side before and
>>> it was
>>> noticed that the application relies on custom loaders which
>>> aren't fully
>>> supported yet.
>>>
>>>
>>> Can you please elaborate more about the support required for handling
>>> custom classloaders.
>>> Do they have an impact on AOT code quality or the training data?
>>
>> As of now, there are a number of implementation-specific constraints
>> imposed to simplify prototyping and experiments. I'm not sure about
>> training data, but code caching is conservatively disabled for all
>> classes loaded by custom loaders. Also, some recent CDS-specific
>> enhancements in Leyden repository (like class preloading, constant
>> pool entries pre-resolution) are disabled for custom loaders. So, as
>> of today, code loaded by custom loaders doesn't benefit much from the
>> work in Leyden.
>>
>>> Until proper support for custom loaders is there, I suggest to
>>> modify
>>> the benchmark so it relies only on existing system loaders.
>>>
>>>
>>> Is there ongoing work to improve the support for custom loaders?
>>
>> I'll let Ioi to comment on the plans about custom loaders. But I
>> wouldn't expect much progress on that front in the short term.
>>
> TL/DR; in the near term, we can probably support only a limited set of
> custom class loader types, if at all.
>
> The main problem with custom class loaders is they can have side effects
> that are observable at the Java level, so strictly we can't even change
> the order of invocations to ClassLoader.loadClass(). Otherwise we may
> change the meaning of the program. E.g.,
>
> public static class MyLoader extends URLClassLoader {
> static int x;
> public MyLoader(URL[] urls, ClassLoader parent) {
> super(urls, parent);
> }
>
> protected Class<?> loadClass(String name, boolean resolve)
> throws ClassNotFoundException {
> x <<= 1;
> x += name.hashCode();
> System.out.println(x);
> return super.loadClass(name, resolve);
> }
> }
>
> If our optimizations change the order of class loading, you will have
> unexpected output in stdout.
>
> So if we want to time shift some computation that uses code loaded by
> MyLoader, how can we authentically preserve all the side effects and
> replay them in the production run?
>
> Even if we remove the I/O from the example above, the value MyLoader.x
> is still observable by other Java code. So how do we make sure
> MyLoader.x has the correct value at every observation point in the
> production run?
>
> For example, assume we have a "constant" expression that we want to fold:
>
> class A { static int m() { return 1; }
>
> class B { static int n() { return 2; }
>
> class C {
>
> static final int x = A.m() + B.n();
>
> }
>
> We not only need to remember "C.x is constant-folded to 3", but also --
> C.<clinit> will first load A and then B.
>
> This means we have to keep C.<clinit> around (even though it has an
> empty body as we removed the computation itself). We must trigger calls
> to C.<clinit> at all static references to C, in order to trigger the
> loading of A and B in the correct order.
>
> As a result, all the AOT code that makes a static reference to C must
> take a class initialization barrier.
>
> So you can see how complexity can quickly get out of hand. For example,
> we can't constant fold C.x in the AOT code because we need to preserve
> the class init barrier.
>
> If you pre-compute a set of objects that include instances of C, the
> situation becomes even worse. You not only need to remember the values
> of these objects, but also the sequence of how they were constructed so
> you can replay the correct sequence of class loading ....
>
> ==============
>
> For loaders that are side-effect free at the Java level
> (URLClassLoader??), maybe the story is simpler.
>
> Thanks
>
> - Ioi
>
>
>>> Another thing that I want to check is the portability of the AOT code.
>>> Do we do anything to ensure the AOT code is portable across
>>> microarchitectures,
>>> that is, it is not tied to the CPU features of the system where the
>>> code is being generated.
>>> If we bundle the cached code archive in containers, which I expect
>>> would be one of the ways to deploy these archives,
>>> then the portability would come into picture.
>>
>> We plan to address that in the future, but, so far, JVM does nothing
>> special in that respect. Moreover, CDS archive itself imposes
>> constraints on supported JVM modes (e.g., compressed oops and
>> compressed class pointer modes should match). But users are free to
>> specify any additional constraints during training process. For
>> example, if shared code archive is generated with -XX:UseAVX=2, there
>> won't be any AVX512 instructions present in the archived code which
>> makes it safe to run on any AVX2-capable hardware.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>> On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov
>>> <vladimir.x.ivanov at oracle.com <mailto:vladimir.x.ivanov at oracle.com>>
>>> wrote:
>>>
>>> Hi Ashutosh,
>>>
>>> Thanks for giving it a try!
>>>
>>> There were some experiments with PetClinic on our side before and it
>>> was
>>> noticed that the application relies on custom loaders which aren't
>>> fully
>>> supported yet. It was the main limiting factor for new
>>> optimizations.
>>> Until proper support for custom loaders is there, I suggest to
>>> modify
>>> the benchmark so it relies only on existing system loaders.
>>>
>>> Speaking of peak performance, some loss of performance is expected.
>>> Cached code is compiled conservatively (e.g., no constant folding
>>> for
>>> static final fields) so it can be reused in deployment runs. For
>>> now,
>>> the intended solution is to eventually recompile cached code online
>>> with
>>> all the optimizations enabled (have to be explicitly enabled
>>> -XX:+UseRecompilation). It's a work-in-progress and our experience
>>> using
>>> it was mixed: recompilation doesn't always fully restore peak
>>> performance.
>>>
>>> But assuming that both CDS and cached code archive are underutilized
>>> (due to aforementioned reliance on custom loaders), 10% sounds
>>> way too
>>> big of a difference. I suggest to experiment with different flag
>>> combinations (e.g., turning ReplayTraining and LoadCachedCode on and
>>> off
>>> independently).
>>>
>>> There's additional diagnostic output JVM produces which may help to
>>> observe effects from new optimizations during both training and
>>> deployment runs:
>>>
>>> * -XX:+PrintCompilation: compilations satisfied from cached code
>>> archive are marked w/ "R";
>>>
>>> * -XX:+CITime: prints information about cached code archive
>>> usage;
>>>
>>> * -Xlog:init=info: produces additional information about some
>>> startup
>>> activities
>>>
>>> * -XX:+PrintSharedArchiveAndExit additionally dumps training data
>>> and
>>> cached code archive info
>>>
>>> * -Xlog:scc*=info and -Xlog:cds*=info print lots of additional
>>> information both during training and deployment
>>>
>>> Hope it helps.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 9/5/23 13:52, Ashutosh Mehra wrote:
>>> > Hi,
>>> >
>>> > We have been interested in persisting the profiling data in
>>> the CDS
>>> > archive with the intention of improving the application's warmup
>>> time.
>>> > And now that the premain branch is here that does save profile
>>> data
>>> > along with AOT, we started playing with the premain branch to
>>> understand
>>> > its impact on the performance.
>>> >
>>> > Our setup uses Springboot Petclinic [0] application and the
>>> CDS and
>>> > shared code archives are generated in a manner similar to this
>>> script [1].
>>> > Our training run only covers the application startup phase. That
>>> means
>>> > at each step we start the application and shut it down without
>>> putting
>>> > any load on it.
>>> >
>>> > Using the archives thus generated I have done few experiments
>>> on my
>>> > local system. In these experiments the application is bound to
>>> two cpus.
>>> > The baseline for comparing the results is the case where the CDS
>>> archive
>>> > does not have any profiling data and there is no shared code
>>> archive.
>>> > The "premain" configuration refers to using a shared code archive
>>> and a
>>> > CDS archive with training data.
>>> >
>>> > Here are some initial results:
>>> >
>>> > 1. Startup: It is heartening to see start-up time improve by
>>> almost 11%.
>>> >
>>> > baseline 10.2s
>>> > premain 9.1s
>>> >
>>> > 2. Warmup:
>>> > This test measures the warmup time by applying load using 1
>>> jmeter
>>> > thread to get an idea of the ramp-up time to reach the peak
>>> throughput.
>>> > The load is applied for the duration of 300 seconds. The graph
>>> [2] for
>>> > aot+profiling configuration shows interesting behavior.
>>> > In the initial period premain is ramping up faster than the
>>> baseline.
>>> > Then the slope of the curve for premain reduces significantly
>>> and a
>>> > couple of dips are also seen. Finally the throughput stabilizes.
>>> > It shows a drastic difference in the warmup time of the
>>> application when
>>> > running with the "premain" config.
>>> >
>>> > 3. Peak throughput: Last experiment is to measure peak
>>> throughput. It
>>> > starts with a warm-up phase of 180 seconds using 1 jmeter thread.
>>> After
>>> > the warmup phase the load is applied with 10 jmeter threads for a
>>> > duration of 5 mins.
>>> > Last two minutes of throughput is considered for measurement. The
>>> graph
>>> > [3] for this test shows almost a 10% drop in the throughput
>>> compared to
>>> > the baseline.
>>> >
>>> >
>>> > I am sure others would have done similar testing. My
>>> questions are:
>>> >
>>> > 1. Are these results on the expected lines?
>>> > 2. Are these tests using the CDS and the shared code (or cached
>>> code)
>>> > archives in the expected manner.
>>> > 3. Warmup time with the premain branch looks pretty bad which is
>>> > surprising. Is there any trick I missed in my tests? Is there
>>> anything
>>> > else that needs to be done to get better warmup time?
>>> > 4. What is the point of creating a new static archive?
>>> Shouldn't the
>>> > applications just create the dynamic archive?
>>> > 5. I am also wondering if there is any design doc that can be
>>> shared
>>> > that explains the AOT compilation strategy adopted in the premain
>>> branch?
>>> >
>>> > I have placed my scripts here [4] in case anyone wants to use
>>> them to
>>> > run these tests (you need to build the Petclinic app before using
>>> these
>>> > scripts).
>>> >
>>> > Please feel free to share your thoughts.
>>> >
>>> > [0] https://github.com/spring-projects/spring-petclinic
>>> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>
>>> > <https://github.com/spring-projects/spring-petclinic
>>> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>>
>>> > [1]
>>> >
>>> https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101 <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$> <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101 <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>>
>>> > [2]
>>> >
>>> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>>
>>> > [3]
>>> >
>>> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>>
>>> > [4] https://github.com/ashu-mehra/leyden-perf
>>> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>
>>> > <https://github.com/ashu-mehra/leyden-perf
>>> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>>
>>> >
>>> > Thanks,
>>> > - Ashutosh Mehra
>>>
>
--
regards,
Andrew Dinn
-----------
Red Hat Distinguished Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill
More information about the leyden-dev
mailing list