<div dir="ltr">Hi Vladimir,<div><br><div>Thanks for providing the explanation on peak performance and diagnostic options.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">There were some experiments with PetClinic on our side before and it was<br>noticed that the application relies on custom loaders which aren't fully<br>supported yet.</blockquote><div><br></div><div>Can you please elaborate more about the support required for handling custom classloaders.</div><div>Do they have an impact on AOT code quality or the training data?</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Until proper support for custom loaders is there, I suggest to modify<br>the benchmark so it relies only on existing system loaders.</blockquote><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><br></div><div>Is there ongoing work to improve the support for custom loaders?</div><div><br></div><div>Another thing that I want to check is the portability of the AOT code. </div><div>Do we do anything to ensure the AOT code is portable across microarchitectures,</div><div>that is, it is not tied to the CPU features of the system where the code is being generated.</div><div>If we bundle the cached code archive in containers, which I expect would be one of the ways to deploy these archives,</div><div>then the portability would come into picture.<br></div><div><br></div><div>Thanks,</div><div>- Ashutosh Mehra</div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com">vladimir.x.ivanov@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Ashutosh,<br>
<br>
Thanks for giving it a try!<br>
<br>
There were some experiments with PetClinic on our side before and it was <br>
noticed that the application relies on custom loaders which aren't fully <br>
supported yet. It was the main limiting factor for new optimizations.<br>
Until proper support for custom loaders is there, I suggest to modify <br>
the benchmark so it relies only on existing system loaders.<br>
<br>
Speaking of peak performance, some loss of performance is expected. <br>
Cached code is compiled conservatively (e.g., no constant folding for <br>
static final fields) so it can be reused in deployment runs. For now, <br>
the intended solution is to eventually recompile cached code online with <br>
all the optimizations enabled (have to be explicitly enabled <br>
-XX:+UseRecompilation). It's a work-in-progress and our experience using <br>
it was mixed: recompilation doesn't always fully restore peak <br>
performance.<br>
<br>
But assuming that both CDS and cached code archive are underutilized <br>
(due to aforementioned reliance on custom loaders), 10% sounds way too <br>
big of a difference. I suggest to experiment with different flag <br>
combinations (e.g., turning ReplayTraining and LoadCachedCode on and off <br>
independently).<br>
<br>
There's additional diagnostic output JVM produces which may help to <br>
observe effects from new optimizations during both training and <br>
deployment runs:<br>
<br>
* -XX:+PrintCompilation: compilations satisfied from cached code <br>
archive are marked w/ "R";<br>
<br>
* -XX:+CITime: prints information about cached code archive usage;<br>
<br>
* -Xlog:init=info: produces additional information about some startup <br>
activities<br>
<br>
* -XX:+PrintSharedArchiveAndExit additionally dumps training data and <br>
cached code archive info<br>
<br>
* -Xlog:scc*=info and -Xlog:cds*=info print lots of additional <br>
information both during training and deployment<br>
<br>
Hope it helps.<br>
<br>
Best regards,<br>
Vladimir Ivanov<br>
<br>
On 9/5/23 13:52, Ashutosh Mehra wrote:<br>
> Hi,<br>
> <br>
> We have been interested in persisting the profiling data in the CDS <br>
> archive with the intention of improving the application's warmup time.<br>
> And now that the premain branch is here that does save profile data <br>
> along with AOT, we started playing with the premain branch to understand <br>
> its impact on the performance.<br>
> <br>
> Our setup uses Springboot Petclinic [0] application and the CDS and <br>
> shared code archives are generated in a manner similar to this script [1].<br>
> Our training run only covers the application startup phase. That means <br>
> at each step we start the application and shut it down without putting <br>
> any load on it.<br>
> <br>
> Using the archives thus generated I have done few experiments on my <br>
> local system. In these experiments the application is bound to two cpus.<br>
> The baseline for comparing the results is the case where the CDS archive <br>
> does not have any profiling data and there is no shared code archive.<br>
> The "premain" configuration refers to using a shared code archive and a <br>
> CDS archive with training data.<br>
> <br>
> Here are some initial results:<br>
> <br>
> 1. Startup: It is heartening to see start-up time improve by almost 11%.<br>
> <br>
> baseline 10.2s<br>
> premain 9.1s<br>
> <br>
> 2. Warmup:<br>
> This test measures the warmup time by applying load using 1 jmeter <br>
> thread to get an idea of the ramp-up time to reach the peak throughput.<br>
> The load is applied for the duration of 300 seconds. The graph [2] for <br>
> aot+profiling configuration shows interesting behavior.<br>
> In the initial period premain is ramping up faster than the baseline. <br>
> Then the slope of the curve for premain reduces significantly and a <br>
> couple of dips are also seen. Finally the throughput stabilizes.<br>
> It shows a drastic difference in the warmup time of the application when <br>
> running with the "premain" config.<br>
> <br>
> 3. Peak throughput: Last experiment is to measure peak throughput. It <br>
> starts with a warm-up phase of 180 seconds using 1 jmeter thread. After <br>
> the warmup phase the load is applied with 10 jmeter threads for a <br>
> duration of 5 mins.<br>
> Last two minutes of throughput is considered for measurement. The graph <br>
> [3] for this test shows almost a 10% drop in the throughput compared to <br>
> the baseline.<br>
> <br>
> <br>
> I am sure others would have done similar testing. My questions are:<br>
> <br>
> 1. Are these results on the expected lines?<br>
> 2. Are these tests using the CDS and the shared code (or cached code) <br>
> archives in the expected manner.<br>
> 3. Warmup time with the premain branch looks pretty bad which is <br>
> surprising. Is there any trick I missed in my tests? Is there anything <br>
> else that needs to be done to get better warmup time?<br>
> 4. What is the point of creating a new static archive? Shouldn't the <br>
> applications just create the dynamic archive?<br>
> 5. I am also wondering if there is any design doc that can be shared <br>
> that explains the AOT compilation strategy adopted in the premain branch?<br>
> <br>
> I have placed my scripts here [4] in case anyone wants to use them to <br>
> run these tests (you need to build the Petclinic app before using these <br>
> scripts).<br>
> <br>
> Please feel free to share your thoughts.<br>
> <br>
> [0] <a href="https://github.com/spring-projects/spring-petclinic" rel="noreferrer" target="_blank">https://github.com/spring-projects/spring-petclinic</a> <br>
> <<a href="https://github.com/spring-projects/spring-petclinic" rel="noreferrer" target="_blank">https://github.com/spring-projects/spring-petclinic</a>><br>
> [1] <br>
> <a href="https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101" rel="noreferrer" target="_blank">https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101</a> <<a href="https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101" rel="noreferrer" target="_blank">https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101</a>><br>
> [2] <br>
> <a href="https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg" rel="noreferrer" target="_blank">https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg</a> <<a href="https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg" rel="noreferrer" target="_blank">https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg</a>><br>
> [3] <br>
> <a href="https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg" rel="noreferrer" target="_blank">https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg</a> <<a href="https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg" rel="noreferrer" target="_blank">https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg</a>><br>
> [4] <a href="https://github.com/ashu-mehra/leyden-perf" rel="noreferrer" target="_blank">https://github.com/ashu-mehra/leyden-perf</a> <br>
> <<a href="https://github.com/ashu-mehra/leyden-perf" rel="noreferrer" target="_blank">https://github.com/ashu-mehra/leyden-perf</a>><br>
> <br>
> Thanks,<br>
> - Ashutosh Mehra<br>
<br>
</blockquote></div>