premain: performance sniff tests

Andrew Dinn adinn at redhat.com
Thu Sep 14 10:48:53 UTC 2023


Just to note: One of the key things a custom classloader can decide at 
runtime is to which loader it wants to delegate responsiblity for 
loading class bytes. This is arguably the most *pertinent* run-time 
dependent side-effect that you might bake in at build time, thereby 
invalidating expected behaviour. It's not the most significant run-time 
dependent side-effect, of course, because they are all equally significant.

regards,


Andrew Dinn
-----------

On 08/09/2023 03:08, ioi.lam at oracle.com wrote:
> On 9/6/23 7:45 PM, Vladimir Ivanov wrote:
>>
>>>     There were some experiments with PetClinic on our side before and 
>>> it was
>>>     noticed that the application relies on custom loaders which 
>>> aren't fully
>>>     supported yet.
>>>
>>>
>>> Can you please elaborate more about the support required for handling 
>>> custom classloaders.
>>> Do they have an impact on AOT code quality or the training data?
>>
>> As of now, there are a number of implementation-specific constraints 
>> imposed to simplify prototyping and experiments. I'm not sure about 
>> training data, but code caching is conservatively disabled for all 
>> classes loaded by custom loaders. Also, some recent CDS-specific 
>> enhancements in Leyden repository (like class preloading, constant 
>> pool entries pre-resolution) are disabled for custom loaders. So, as 
>> of today, code loaded by custom loaders doesn't benefit much from the 
>> work in Leyden.
>>
>>>     Until proper support for custom loaders is there, I suggest to 
>>> modify
>>>     the benchmark so it relies only on existing system loaders.
>>>
>>>
>>> Is there ongoing work to improve the support for custom loaders?
>>
>> I'll let Ioi to comment on the plans about custom loaders. But I 
>> wouldn't expect much progress on that front in the short term.
>>
> TL/DR; in the near term, we can probably support only a limited set of 
> custom class loader types, if at all.
> 
> The main problem with custom class loaders is they can have side effects 
> that are observable at the Java level, so strictly we can't even change 
> the order of invocations to ClassLoader.loadClass(). Otherwise we may 
> change the meaning of the program. E.g.,
> 
>      public static class MyLoader extends URLClassLoader {
>          static int x;
>          public MyLoader(URL[] urls, ClassLoader parent) {
>              super(urls, parent);
>          }
> 
>           protected Class<?> loadClass(String name, boolean resolve) 
> throws ClassNotFoundException {
>              x <<= 1;
>              x += name.hashCode();
>              System.out.println(x);
>              return super.loadClass(name, resolve);
>          }
>      }
> 
> If our optimizations change the order of class loading, you will have 
> unexpected output in stdout.
> 
> So if we want to time shift some computation that uses code loaded by 
> MyLoader, how can we authentically preserve all the side effects and 
> replay them in the production run?
> 
> Even if we remove the I/O from the example above, the value MyLoader.x 
> is still observable by other Java code. So how do we make sure 
> MyLoader.x has the correct value at every observation point in the 
> production run?
> 
> For example, assume we have a "constant" expression that we want to fold:
> 
>      class A { static int m() { return 1; }
> 
>      class B { static int n() { return 2; }
> 
>      class C {
> 
>          static final int x = A.m() + B.n();
> 
>      }
> 
> We not only need to remember "C.x is constant-folded to 3", but also -- 
> C.<clinit> will first load A and then B.
> 
> This means we have to keep C.<clinit> around (even though it has an 
> empty body as we removed the computation itself). We must trigger calls 
> to C.<clinit> at all static references to C, in order to trigger the 
> loading of A and B in the correct order.
> 
> As a result, all the AOT code that makes a static reference to C must 
> take a class initialization barrier.
> 
> So you can see how complexity can quickly get out of hand. For example, 
> we can't constant fold C.x in the AOT code because we need to preserve 
> the class init barrier.
> 
> If you pre-compute a set of objects that include instances of C, the 
> situation becomes even worse. You not only need to remember the values 
> of these objects, but also the sequence of how they were constructed so 
> you can replay the correct sequence of class loading  ....
> 
> ==============
> 
> For loaders that are side-effect free at the Java level 
> (URLClassLoader??), maybe the story is simpler.
> 
> Thanks
> 
> - Ioi
> 
> 
>>> Another thing that I want to check is the portability of the AOT code.
>>> Do we do anything to ensure the AOT code is portable across 
>>> microarchitectures,
>>> that is, it is not tied to the CPU features of the system where the 
>>> code is being generated.
>>> If we bundle the cached code archive in containers, which I expect 
>>> would be one of the ways to deploy these archives,
>>> then the portability would come into picture.
>>
>> We plan to address that in the future, but, so far, JVM does nothing 
>> special in that respect. Moreover, CDS archive itself imposes 
>> constraints on supported JVM modes (e.g., compressed oops and 
>> compressed class pointer modes should match). But users are free to 
>> specify any additional constraints during training process. For 
>> example, if shared code archive is generated with -XX:UseAVX=2, there 
>> won't be any AVX512 instructions present in the archived code which 
>> makes it safe to run on any AVX2-capable hardware.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>> On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov 
>>> <vladimir.x.ivanov at oracle.com <mailto:vladimir.x.ivanov at oracle.com>> 
>>> wrote:
>>>
>>>     Hi Ashutosh,
>>>
>>>     Thanks for giving it a try!
>>>
>>>     There were some experiments with PetClinic on our side before and it
>>>     was
>>>     noticed that the application relies on custom loaders which aren't
>>>     fully
>>>     supported yet. It was the main limiting factor for new 
>>> optimizations.
>>>     Until proper support for custom loaders is there, I suggest to 
>>> modify
>>>     the benchmark so it relies only on existing system loaders.
>>>
>>>     Speaking of peak performance, some loss of performance is expected.
>>>     Cached code is compiled conservatively (e.g., no constant folding 
>>> for
>>>     static final fields) so it can be reused in deployment runs. For 
>>> now,
>>>     the intended solution is to eventually recompile cached code online
>>>     with
>>>     all the optimizations enabled (have to be explicitly enabled
>>>     -XX:+UseRecompilation). It's a work-in-progress and our experience
>>>     using
>>>     it was mixed: recompilation doesn't always fully restore peak
>>>     performance.
>>>
>>>     But assuming that both CDS and cached code archive are underutilized
>>>     (due to aforementioned reliance on custom loaders), 10% sounds 
>>> way too
>>>     big of a difference. I suggest to experiment with different flag
>>>     combinations (e.g., turning ReplayTraining and LoadCachedCode on and
>>>     off
>>>     independently).
>>>
>>>     There's additional diagnostic output JVM produces which may help to
>>>     observe effects from new optimizations during both training and
>>>     deployment runs:
>>>
>>>        * -XX:+PrintCompilation: compilations satisfied from cached code
>>>     archive are marked w/ "R";
>>>
>>>        * -XX:+CITime:  prints information about cached code archive 
>>> usage;
>>>
>>>        * -Xlog:init=info: produces additional information about some
>>>     startup
>>>     activities
>>>
>>>        * -XX:+PrintSharedArchiveAndExit additionally dumps training data
>>>     and
>>>     cached code archive info
>>>
>>>        * -Xlog:scc*=info and -Xlog:cds*=info print lots of additional
>>>     information both during training and deployment
>>>
>>>     Hope it helps.
>>>
>>>     Best regards,
>>>     Vladimir Ivanov
>>>
>>>     On 9/5/23 13:52, Ashutosh Mehra wrote:
>>>      > Hi,
>>>      >
>>>      > We have been interested in persisting the profiling data in 
>>> the CDS
>>>      > archive with the intention of improving the application's warmup
>>>     time.
>>>      > And now that the premain branch is here that does save profile 
>>> data
>>>      > along with AOT, we started playing with the premain branch to
>>>     understand
>>>      > its impact on the performance.
>>>      >
>>>      > Our setup uses Springboot Petclinic [0] application and the 
>>> CDS and
>>>      > shared code archives are generated in a manner similar to this
>>>     script [1].
>>>      > Our training run only covers the application startup phase. That
>>>     means
>>>      > at each step we start the application and shut it down without
>>>     putting
>>>      > any load on it.
>>>      >
>>>      > Using the archives thus generated I have done few experiments 
>>> on my
>>>      > local system. In these experiments the application is bound to
>>>     two cpus.
>>>      > The baseline for comparing the results is the case where the CDS
>>>     archive
>>>      > does not have any profiling data and there is no shared code 
>>> archive.
>>>      > The "premain" configuration refers to using a shared code archive
>>>     and a
>>>      > CDS archive with training data.
>>>      >
>>>      > Here are some initial results:
>>>      >
>>>      > 1. Startup: It is heartening to see start-up time improve by
>>>     almost 11%.
>>>      >
>>>      > baseline       10.2s
>>>      > premain         9.1s
>>>      >
>>>      > 2. Warmup:
>>>      > This test measures the warmup time by applying load using 1 
>>> jmeter
>>>      > thread to get an idea of the ramp-up time to reach the peak
>>>     throughput.
>>>      > The load is applied for the duration of 300 seconds. The graph
>>>     [2] for
>>>      > aot+profiling configuration shows interesting behavior.
>>>      > In the initial period premain is ramping up faster than the
>>>     baseline.
>>>      > Then the slope of the curve for premain reduces significantly 
>>> and a
>>>      > couple of dips are also seen. Finally the throughput stabilizes.
>>>      > It shows a drastic difference in the warmup time of the
>>>     application when
>>>      > running with the "premain" config.
>>>      >
>>>      > 3. Peak throughput: Last experiment is to measure peak
>>>     throughput. It
>>>      > starts with a warm-up phase of 180 seconds using 1 jmeter thread.
>>>     After
>>>      > the warmup phase the load is applied with 10 jmeter threads for a
>>>      > duration of 5 mins.
>>>      > Last two minutes of throughput is considered for measurement. The
>>>     graph
>>>      > [3] for this test shows almost a 10% drop in the throughput
>>>     compared to
>>>      > the baseline.
>>>      >
>>>      >
>>>      > I am sure others would have done similar testing.  My 
>>> questions are:
>>>      >
>>>      > 1. Are these results on the expected lines?
>>>      > 2. Are these tests using the CDS and the shared code (or cached
>>>     code)
>>>      > archives in the expected manner.
>>>      > 3. Warmup time with the premain branch looks pretty bad which is
>>>      > surprising. Is there any trick I missed in my tests? Is there
>>>     anything
>>>      > else that needs to be done to get better warmup time?
>>>      > 4. What is the point of creating a new static archive? 
>>> Shouldn't the
>>>      > applications just create the dynamic archive?
>>>      > 5. I am also wondering if there is any design doc that can be 
>>> shared
>>>      > that explains the AOT compilation strategy adopted in the premain
>>>     branch?
>>>      >
>>>      > I have placed my scripts here [4] in case anyone wants to use
>>>     them to
>>>      > run these tests (you need to build the Petclinic app before using
>>>     these
>>>      > scripts).
>>>      >
>>>      > Please feel free to share your thoughts.
>>>      >
>>>      > [0] https://github.com/spring-projects/spring-petclinic
>>> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>
>>>      > <https://github.com/spring-projects/spring-petclinic
>>> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>>
>>>      > [1]
>>>      >
>>> https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101 <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$> <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101 <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>>
>>>      > [2]
>>>      >
>>> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>>
>>>      > [3]
>>>      >
>>> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>>
>>>      > [4] https://github.com/ashu-mehra/leyden-perf
>>> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>
>>>      > <https://github.com/ashu-mehra/leyden-perf
>>> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>>
>>>      >
>>>      > Thanks,
>>>      > - Ashutosh Mehra
>>>
> 

-- 
regards,


Andrew Dinn
-----------
Red Hat Distinguished Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill



More information about the leyden-dev mailing list