Hi, We have been interested in persisting the profiling data in the CDS archive with the intention of improving the application's warmup time. And now that the premain branch is here that does save profile data along with AOT, we started playing with the premain branch to understand its impact on the performance. Our setup uses Springboot Petclinic [0] application and the CDS and shared code archives are generated in a manner similar to this script [1]. Our training run only covers the application startup phase. That means at each step we start the application and shut it down without putting any load on it. Using the archives thus generated I have done few experiments on my local system. In these experiments the application is bound to two cpus. The baseline for comparing the results is the case where the CDS archive does not have any profiling data and there is no shared code archive. The "premain" configuration refers to using a shared code archive and a CDS archive with training data. Here are some initial results: 1. Startup: It is heartening to see start-up time improve by almost 11%. baseline 10.2s premain 9.1s 2. Warmup: This test measures the warmup time by applying load using 1 jmeter thread to get an idea of the ramp-up time to reach the peak throughput. The load is applied for the duration of 300 seconds. The graph [2] for aot+profiling configuration shows interesting behavior. In the initial period premain is ramping up faster than the baseline. Then the slope of the curve for premain reduces significantly and a couple of dips are also seen. Finally the throughput stabilizes. It shows a drastic difference in the warmup time of the application when running with the "premain" config. 3. Peak throughput: Last experiment is to measure peak throughput. It starts with a warm-up phase of 180 seconds using 1 jmeter thread. After the warmup phase the load is applied with 10 jmeter threads for a duration of 5 mins. Last two minutes of throughput is considered for measurement. The graph [3] for this test shows almost a 10% drop in the throughput compared to the baseline. I am sure others would have done similar testing. My questions are: 1. Are these results on the expected lines? 2. Are these tests using the CDS and the shared code (or cached code) archives in the expected manner. 3. Warmup time with the premain branch looks pretty bad which is surprising. Is there any trick I missed in my tests? Is there anything else that needs to be done to get better warmup time? 4. What is the point of creating a new static archive? Shouldn't the applications just create the dynamic archive? 5. I am also wondering if there is any design doc that can be shared that explains the AOT compilation strategy adopted in the premain branch? I have placed my scripts here [4] in case anyone wants to use them to run these tests (you need to build the Petclinic app before using these scripts). Please feel free to share your thoughts. [0] https://github.com/spring-projects/spring-petclinic [1] https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... [2] https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... [3] https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... [4] https://github.com/ashu-mehra/leyden-perf Thanks, - Ashutosh Mehra
Hi Ashutosh, Thanks for giving it a try! There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet. It was the main limiting factor for new optimizations. Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders. Speaking of peak performance, some loss of performance is expected. Cached code is compiled conservatively (e.g., no constant folding for static final fields) so it can be reused in deployment runs. For now, the intended solution is to eventually recompile cached code online with all the optimizations enabled (have to be explicitly enabled -XX:+UseRecompilation). It's a work-in-progress and our experience using it was mixed: recompilation doesn't always fully restore peak performance. But assuming that both CDS and cached code archive are underutilized (due to aforementioned reliance on custom loaders), 10% sounds way too big of a difference. I suggest to experiment with different flag combinations (e.g., turning ReplayTraining and LoadCachedCode on and off independently). There's additional diagnostic output JVM produces which may help to observe effects from new optimizations during both training and deployment runs: * -XX:+PrintCompilation: compilations satisfied from cached code archive are marked w/ "R"; * -XX:+CITime: prints information about cached code archive usage; * -Xlog:init=info: produces additional information about some startup activities * -XX:+PrintSharedArchiveAndExit additionally dumps training data and cached code archive info * -Xlog:scc*=info and -Xlog:cds*=info print lots of additional information both during training and deployment Hope it helps. Best regards, Vladimir Ivanov On 9/5/23 13:52, Ashutosh Mehra wrote:
Hi,
We have been interested in persisting the profiling data in the CDS archive with the intention of improving the application's warmup time. And now that the premain branch is here that does save profile data along with AOT, we started playing with the premain branch to understand its impact on the performance.
Our setup uses Springboot Petclinic [0] application and the CDS and shared code archives are generated in a manner similar to this script [1]. Our training run only covers the application startup phase. That means at each step we start the application and shut it down without putting any load on it.
Using the archives thus generated I have done few experiments on my local system. In these experiments the application is bound to two cpus. The baseline for comparing the results is the case where the CDS archive does not have any profiling data and there is no shared code archive. The "premain" configuration refers to using a shared code archive and a CDS archive with training data.
Here are some initial results:
1. Startup: It is heartening to see start-up time improve by almost 11%.
baseline 10.2s premain 9.1s
2. Warmup: This test measures the warmup time by applying load using 1 jmeter thread to get an idea of the ramp-up time to reach the peak throughput. The load is applied for the duration of 300 seconds. The graph [2] for aot+profiling configuration shows interesting behavior. In the initial period premain is ramping up faster than the baseline. Then the slope of the curve for premain reduces significantly and a couple of dips are also seen. Finally the throughput stabilizes. It shows a drastic difference in the warmup time of the application when running with the "premain" config.
3. Peak throughput: Last experiment is to measure peak throughput. It starts with a warm-up phase of 180 seconds using 1 jmeter thread. After the warmup phase the load is applied with 10 jmeter threads for a duration of 5 mins. Last two minutes of throughput is considered for measurement. The graph [3] for this test shows almost a 10% drop in the throughput compared to the baseline.
I am sure others would have done similar testing. My questions are:
1. Are these results on the expected lines? 2. Are these tests using the CDS and the shared code (or cached code) archives in the expected manner. 3. Warmup time with the premain branch looks pretty bad which is surprising. Is there any trick I missed in my tests? Is there anything else that needs to be done to get better warmup time? 4. What is the point of creating a new static archive? Shouldn't the applications just create the dynamic archive? 5. I am also wondering if there is any design doc that can be shared that explains the AOT compilation strategy adopted in the premain branch?
I have placed my scripts here [4] in case anyone wants to use them to run these tests (you need to build the Petclinic app before using these scripts).
Please feel free to share your thoughts.
[0] https://github.com/spring-projects/spring-petclinic <https://github.com/spring-projects/spring-petclinic> [1] https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101> [2] https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg> [3] https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg> [4] https://github.com/ashu-mehra/leyden-perf <https://github.com/ashu-mehra/leyden-perf>
Thanks, - Ashutosh Mehra
We see about 300ms -> 90ms improvement for "javac HelloWorld.java" You can see the numbers at the end of: https://github.com/openjdk/leyden/blob/premain/test/hotspot/jtreg/premain/ja... Wall clock time - geomean over 10 runs of 'perf stat -r 16 javac HelloWorld.java' Mainline JDK (CDS disabled) 302.86 ms Mainline JDK (CDS enabled) 161.34 ms Premain Prototype (CDS only) 131.71 ms Premain Prototype (CDS + AOT) 92.84 ms ... with the typical disclaimer that the code is in very early stage so your mileage will vary ... Thanks - Ioi On 9/5/23 4:23 PM, Vladimir Ivanov wrote:
Hi Ashutosh,
Thanks for giving it a try!
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet. It was the main limiting factor for new optimizations. Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Speaking of peak performance, some loss of performance is expected. Cached code is compiled conservatively (e.g., no constant folding for static final fields) so it can be reused in deployment runs. For now, the intended solution is to eventually recompile cached code online with all the optimizations enabled (have to be explicitly enabled -XX:+UseRecompilation). It's a work-in-progress and our experience using it was mixed: recompilation doesn't always fully restore peak performance.
But assuming that both CDS and cached code archive are underutilized (due to aforementioned reliance on custom loaders), 10% sounds way too big of a difference. I suggest to experiment with different flag combinations (e.g., turning ReplayTraining and LoadCachedCode on and off independently).
There's additional diagnostic output JVM produces which may help to observe effects from new optimizations during both training and deployment runs:
* -XX:+PrintCompilation: compilations satisfied from cached code archive are marked w/ "R";
* -XX:+CITime: prints information about cached code archive usage;
* -Xlog:init=info: produces additional information about some startup activities
* -XX:+PrintSharedArchiveAndExit additionally dumps training data and cached code archive info
* -Xlog:scc*=info and -Xlog:cds*=info print lots of additional information both during training and deployment
Hope it helps.
Best regards, Vladimir Ivanov
On 9/5/23 13:52, Ashutosh Mehra wrote:
Hi,
We have been interested in persisting the profiling data in the CDS archive with the intention of improving the application's warmup time. And now that the premain branch is here that does save profile data along with AOT, we started playing with the premain branch to understand its impact on the performance.
Our setup uses Springboot Petclinic [0] application and the CDS and shared code archives are generated in a manner similar to this script [1]. Our training run only covers the application startup phase. That means at each step we start the application and shut it down without putting any load on it.
Using the archives thus generated I have done few experiments on my local system. In these experiments the application is bound to two cpus. The baseline for comparing the results is the case where the CDS archive does not have any profiling data and there is no shared code archive. The "premain" configuration refers to using a shared code archive and a CDS archive with training data.
Here are some initial results:
1. Startup: It is heartening to see start-up time improve by almost 11%.
baseline 10.2s premain 9.1s
2. Warmup: This test measures the warmup time by applying load using 1 jmeter thread to get an idea of the ramp-up time to reach the peak throughput. The load is applied for the duration of 300 seconds. The graph [2] for aot+profiling configuration shows interesting behavior. In the initial period premain is ramping up faster than the baseline. Then the slope of the curve for premain reduces significantly and a couple of dips are also seen. Finally the throughput stabilizes. It shows a drastic difference in the warmup time of the application when running with the "premain" config.
3. Peak throughput: Last experiment is to measure peak throughput. It starts with a warm-up phase of 180 seconds using 1 jmeter thread. After the warmup phase the load is applied with 10 jmeter threads for a duration of 5 mins. Last two minutes of throughput is considered for measurement. The graph [3] for this test shows almost a 10% drop in the throughput compared to the baseline.
I am sure others would have done similar testing. My questions are:
1. Are these results on the expected lines? 2. Are these tests using the CDS and the shared code (or cached code) archives in the expected manner. 3. Warmup time with the premain branch looks pretty bad which is surprising. Is there any trick I missed in my tests? Is there anything else that needs to be done to get better warmup time? 4. What is the point of creating a new static archive? Shouldn't the applications just create the dynamic archive? 5. I am also wondering if there is any design doc that can be shared that explains the AOT compilation strategy adopted in the premain branch?
I have placed my scripts here [4] in case anyone wants to use them to run these tests (you need to build the Petclinic app before using these scripts).
Please feel free to share your thoughts.
[0] https://github.com/spring-projects/spring-petclinic <https://github.com/spring-projects/spring-petclinic> [1] https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh#L70-L101>
[2] https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg>
[3] https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg>
[4] https://github.com/ashu-mehra/leyden-perf <https://github.com/ashu-mehra/leyden-perf>
Thanks, - Ashutosh Mehra
Hi Vladimir, Thanks for providing the explanation on peak performance and diagnostic options. There were some experiments with PetClinic on our side before and it was
noticed that the application relies on custom loaders which aren't fully supported yet.
Can you please elaborate more about the support required for handling custom classloaders. Do they have an impact on AOT code quality or the training data? Until proper support for custom loaders is there, I suggest to modify
the benchmark so it relies only on existing system loaders.
Is there ongoing work to improve the support for custom loaders? Another thing that I want to check is the portability of the AOT code. Do we do anything to ensure the AOT code is portable across microarchitectures, that is, it is not tied to the CPU features of the system where the code is being generated. If we bundle the cached code archive in containers, which I expect would be one of the ways to deploy these archives, then the portability would come into picture. Thanks, - Ashutosh Mehra On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov <vladimir.x.ivanov@oracle.com> wrote:
Hi Ashutosh,
Thanks for giving it a try!
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet. It was the main limiting factor for new optimizations. Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Speaking of peak performance, some loss of performance is expected. Cached code is compiled conservatively (e.g., no constant folding for static final fields) so it can be reused in deployment runs. For now, the intended solution is to eventually recompile cached code online with all the optimizations enabled (have to be explicitly enabled -XX:+UseRecompilation). It's a work-in-progress and our experience using it was mixed: recompilation doesn't always fully restore peak performance.
But assuming that both CDS and cached code archive are underutilized (due to aforementioned reliance on custom loaders), 10% sounds way too big of a difference. I suggest to experiment with different flag combinations (e.g., turning ReplayTraining and LoadCachedCode on and off independently).
There's additional diagnostic output JVM produces which may help to observe effects from new optimizations during both training and deployment runs:
* -XX:+PrintCompilation: compilations satisfied from cached code archive are marked w/ "R";
* -XX:+CITime: prints information about cached code archive usage;
* -Xlog:init=info: produces additional information about some startup activities
* -XX:+PrintSharedArchiveAndExit additionally dumps training data and cached code archive info
* -Xlog:scc*=info and -Xlog:cds*=info print lots of additional information both during training and deployment
Hope it helps.
Best regards, Vladimir Ivanov
On 9/5/23 13:52, Ashutosh Mehra wrote:
Hi,
We have been interested in persisting the profiling data in the CDS archive with the intention of improving the application's warmup time. And now that the premain branch is here that does save profile data along with AOT, we started playing with the premain branch to understand its impact on the performance.
Our setup uses Springboot Petclinic [0] application and the CDS and shared code archives are generated in a manner similar to this script [1]. Our training run only covers the application startup phase. That means at each step we start the application and shut it down without putting any load on it.
Using the archives thus generated I have done few experiments on my local system. In these experiments the application is bound to two cpus. The baseline for comparing the results is the case where the CDS archive does not have any profiling data and there is no shared code archive. The "premain" configuration refers to using a shared code archive and a CDS archive with training data.
Here are some initial results:
1. Startup: It is heartening to see start-up time improve by almost 11%.
baseline 10.2s premain 9.1s
2. Warmup: This test measures the warmup time by applying load using 1 jmeter thread to get an idea of the ramp-up time to reach the peak throughput. The load is applied for the duration of 300 seconds. The graph [2] for aot+profiling configuration shows interesting behavior. In the initial period premain is ramping up faster than the baseline. Then the slope of the curve for premain reduces significantly and a couple of dips are also seen. Finally the throughput stabilizes. It shows a drastic difference in the warmup time of the application when running with the "premain" config.
3. Peak throughput: Last experiment is to measure peak throughput. It starts with a warm-up phase of 180 seconds using 1 jmeter thread. After the warmup phase the load is applied with 10 jmeter threads for a duration of 5 mins. Last two minutes of throughput is considered for measurement. The graph [3] for this test shows almost a 10% drop in the throughput compared to the baseline.
I am sure others would have done similar testing. My questions are:
1. Are these results on the expected lines? 2. Are these tests using the CDS and the shared code (or cached code) archives in the expected manner. 3. Warmup time with the premain branch looks pretty bad which is surprising. Is there any trick I missed in my tests? Is there anything else that needs to be done to get better warmup time? 4. What is the point of creating a new static archive? Shouldn't the applications just create the dynamic archive? 5. I am also wondering if there is any design doc that can be shared that explains the AOT compilation strategy adopted in the premain branch?
I have placed my scripts here [4] in case anyone wants to use them to run these tests (you need to build the Petclinic app before using these scripts).
Please feel free to share your thoughts.
[0] https://github.com/spring-projects/spring-petclinic <https://github.com/spring-projects/spring-petclinic> [1]
https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... < https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605...
[2]
https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... < https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s...
[3]
https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... < https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10....
[4] https://github.com/ashu-mehra/leyden-perf <https://github.com/ashu-mehra/leyden-perf>
Thanks, - Ashutosh Mehra
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet.
Can you please elaborate more about the support required for handling custom classloaders. Do they have an impact on AOT code quality or the training data?
As of now, there are a number of implementation-specific constraints imposed to simplify prototyping and experiments. I'm not sure about training data, but code caching is conservatively disabled for all classes loaded by custom loaders. Also, some recent CDS-specific enhancements in Leyden repository (like class preloading, constant pool entries pre-resolution) are disabled for custom loaders. So, as of today, code loaded by custom loaders doesn't benefit much from the work in Leyden.
Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Is there ongoing work to improve the support for custom loaders?
I'll let Ioi to comment on the plans about custom loaders. But I wouldn't expect much progress on that front in the short term.
Another thing that I want to check is the portability of the AOT code. Do we do anything to ensure the AOT code is portable across microarchitectures, that is, it is not tied to the CPU features of the system where the code is being generated. If we bundle the cached code archive in containers, which I expect would be one of the ways to deploy these archives, then the portability would come into picture.
We plan to address that in the future, but, so far, JVM does nothing special in that respect. Moreover, CDS archive itself imposes constraints on supported JVM modes (e.g., compressed oops and compressed class pointer modes should match). But users are free to specify any additional constraints during training process. For example, if shared code archive is generated with -XX:UseAVX=2, there won't be any AVX512 instructions present in the archived code which makes it safe to run on any AVX2-capable hardware. Best regards, Vladimir Ivanov
On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov <vladimir.x.ivanov@oracle.com <mailto:vladimir.x.ivanov@oracle.com>> wrote:
Hi Ashutosh,
Thanks for giving it a try!
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet. It was the main limiting factor for new optimizations. Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Speaking of peak performance, some loss of performance is expected. Cached code is compiled conservatively (e.g., no constant folding for static final fields) so it can be reused in deployment runs. For now, the intended solution is to eventually recompile cached code online with all the optimizations enabled (have to be explicitly enabled -XX:+UseRecompilation). It's a work-in-progress and our experience using it was mixed: recompilation doesn't always fully restore peak performance.
But assuming that both CDS and cached code archive are underutilized (due to aforementioned reliance on custom loaders), 10% sounds way too big of a difference. I suggest to experiment with different flag combinations (e.g., turning ReplayTraining and LoadCachedCode on and off independently).
There's additional diagnostic output JVM produces which may help to observe effects from new optimizations during both training and deployment runs:
* -XX:+PrintCompilation: compilations satisfied from cached code archive are marked w/ "R";
* -XX:+CITime: prints information about cached code archive usage;
* -Xlog:init=info: produces additional information about some startup activities
* -XX:+PrintSharedArchiveAndExit additionally dumps training data and cached code archive info
* -Xlog:scc*=info and -Xlog:cds*=info print lots of additional information both during training and deployment
Hope it helps.
Best regards, Vladimir Ivanov
On 9/5/23 13:52, Ashutosh Mehra wrote: > Hi, > > We have been interested in persisting the profiling data in the CDS > archive with the intention of improving the application's warmup time. > And now that the premain branch is here that does save profile data > along with AOT, we started playing with the premain branch to understand > its impact on the performance. > > Our setup uses Springboot Petclinic [0] application and the CDS and > shared code archives are generated in a manner similar to this script [1]. > Our training run only covers the application startup phase. That means > at each step we start the application and shut it down without putting > any load on it. > > Using the archives thus generated I have done few experiments on my > local system. In these experiments the application is bound to two cpus. > The baseline for comparing the results is the case where the CDS archive > does not have any profiling data and there is no shared code archive. > The "premain" configuration refers to using a shared code archive and a > CDS archive with training data. > > Here are some initial results: > > 1. Startup: It is heartening to see start-up time improve by almost 11%. > > baseline 10.2s > premain 9.1s > > 2. Warmup: > This test measures the warmup time by applying load using 1 jmeter > thread to get an idea of the ramp-up time to reach the peak throughput. > The load is applied for the duration of 300 seconds. The graph [2] for > aot+profiling configuration shows interesting behavior. > In the initial period premain is ramping up faster than the baseline. > Then the slope of the curve for premain reduces significantly and a > couple of dips are also seen. Finally the throughput stabilizes. > It shows a drastic difference in the warmup time of the application when > running with the "premain" config. > > 3. Peak throughput: Last experiment is to measure peak throughput. It > starts with a warm-up phase of 180 seconds using 1 jmeter thread. After > the warmup phase the load is applied with 10 jmeter threads for a > duration of 5 mins. > Last two minutes of throughput is considered for measurement. The graph > [3] for this test shows almost a 10% drop in the throughput compared to > the baseline. > > > I am sure others would have done similar testing. My questions are: > > 1. Are these results on the expected lines? > 2. Are these tests using the CDS and the shared code (or cached code) > archives in the expected manner. > 3. Warmup time with the premain branch looks pretty bad which is > surprising. Is there any trick I missed in my tests? Is there anything > else that needs to be done to get better warmup time? > 4. What is the point of creating a new static archive? Shouldn't the > applications just create the dynamic archive? > 5. I am also wondering if there is any design doc that can be shared > that explains the AOT compilation strategy adopted in the premain branch? > > I have placed my scripts here [4] in case anyone wants to use them to > run these tests (you need to build the Petclinic app before using these > scripts). > > Please feel free to share your thoughts. > > [0] https://github.com/spring-projects/spring-petclinic <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$> > <https://github.com/spring-projects/spring-petclinic <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>> > [1] > https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$> <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>> > [2] > https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>> > [3] > https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>> > [4] https://github.com/ashu-mehra/leyden-perf <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$> > <https://github.com/ashu-mehra/leyden-perf <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>> > > Thanks, > - Ashutosh Mehra
On 9/6/23 7:45 PM, Vladimir Ivanov wrote:
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet.
Can you please elaborate more about the support required for handling custom classloaders. Do they have an impact on AOT code quality or the training data?
As of now, there are a number of implementation-specific constraints imposed to simplify prototyping and experiments. I'm not sure about training data, but code caching is conservatively disabled for all classes loaded by custom loaders. Also, some recent CDS-specific enhancements in Leyden repository (like class preloading, constant pool entries pre-resolution) are disabled for custom loaders. So, as of today, code loaded by custom loaders doesn't benefit much from the work in Leyden.
Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Is there ongoing work to improve the support for custom loaders?
I'll let Ioi to comment on the plans about custom loaders. But I wouldn't expect much progress on that front in the short term.
TL/DR; in the near term, we can probably support only a limited set of custom class loader types, if at all. The main problem with custom class loaders is they can have side effects that are observable at the Java level, so strictly we can't even change the order of invocations to ClassLoader.loadClass(). Otherwise we may change the meaning of the program. E.g., public static class MyLoader extends URLClassLoader { static int x; public MyLoader(URL[] urls, ClassLoader parent) { super(urls, parent); } protected Class<?> loadClass(String name, boolean resolve) throws ClassNotFoundException { x <<= 1; x += name.hashCode(); System.out.println(x); return super.loadClass(name, resolve); } } If our optimizations change the order of class loading, you will have unexpected output in stdout. So if we want to time shift some computation that uses code loaded by MyLoader, how can we authentically preserve all the side effects and replay them in the production run? Even if we remove the I/O from the example above, the value MyLoader.x is still observable by other Java code. So how do we make sure MyLoader.x has the correct value at every observation point in the production run? For example, assume we have a "constant" expression that we want to fold: class A { static int m() { return 1; } class B { static int n() { return 2; } class C { static final int x = A.m() + B.n(); } We not only need to remember "C.x is constant-folded to 3", but also -- C.<clinit> will first load A and then B. This means we have to keep C.<clinit> around (even though it has an empty body as we removed the computation itself). We must trigger calls to C.<clinit> at all static references to C, in order to trigger the loading of A and B in the correct order. As a result, all the AOT code that makes a static reference to C must take a class initialization barrier. So you can see how complexity can quickly get out of hand. For example, we can't constant fold C.x in the AOT code because we need to preserve the class init barrier. If you pre-compute a set of objects that include instances of C, the situation becomes even worse. You not only need to remember the values of these objects, but also the sequence of how they were constructed so you can replay the correct sequence of class loading .... ============== For loaders that are side-effect free at the Java level (URLClassLoader??), maybe the story is simpler. Thanks - Ioi
Another thing that I want to check is the portability of the AOT code. Do we do anything to ensure the AOT code is portable across microarchitectures, that is, it is not tied to the CPU features of the system where the code is being generated. If we bundle the cached code archive in containers, which I expect would be one of the ways to deploy these archives, then the portability would come into picture.
We plan to address that in the future, but, so far, JVM does nothing special in that respect. Moreover, CDS archive itself imposes constraints on supported JVM modes (e.g., compressed oops and compressed class pointer modes should match). But users are free to specify any additional constraints during training process. For example, if shared code archive is generated with -XX:UseAVX=2, there won't be any AVX512 instructions present in the archived code which makes it safe to run on any AVX2-capable hardware.
Best regards, Vladimir Ivanov
On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov <vladimir.x.ivanov@oracle.com <mailto:vladimir.x.ivanov@oracle.com>> wrote:
Hi Ashutosh,
Thanks for giving it a try!
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet. It was the main limiting factor for new optimizations. Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Speaking of peak performance, some loss of performance is expected. Cached code is compiled conservatively (e.g., no constant folding for static final fields) so it can be reused in deployment runs. For now, the intended solution is to eventually recompile cached code online with all the optimizations enabled (have to be explicitly enabled -XX:+UseRecompilation). It's a work-in-progress and our experience using it was mixed: recompilation doesn't always fully restore peak performance.
But assuming that both CDS and cached code archive are underutilized (due to aforementioned reliance on custom loaders), 10% sounds way too big of a difference. I suggest to experiment with different flag combinations (e.g., turning ReplayTraining and LoadCachedCode on and off independently).
There's additional diagnostic output JVM produces which may help to observe effects from new optimizations during both training and deployment runs:
* -XX:+PrintCompilation: compilations satisfied from cached code archive are marked w/ "R";
* -XX:+CITime: prints information about cached code archive usage;
* -Xlog:init=info: produces additional information about some startup activities
* -XX:+PrintSharedArchiveAndExit additionally dumps training data and cached code archive info
* -Xlog:scc*=info and -Xlog:cds*=info print lots of additional information both during training and deployment
Hope it helps.
Best regards, Vladimir Ivanov
On 9/5/23 13:52, Ashutosh Mehra wrote: > Hi, > > We have been interested in persisting the profiling data in the CDS > archive with the intention of improving the application's warmup time. > And now that the premain branch is here that does save profile data > along with AOT, we started playing with the premain branch to understand > its impact on the performance. > > Our setup uses Springboot Petclinic [0] application and the CDS and > shared code archives are generated in a manner similar to this script [1]. > Our training run only covers the application startup phase. That means > at each step we start the application and shut it down without putting > any load on it. > > Using the archives thus generated I have done few experiments on my > local system. In these experiments the application is bound to two cpus. > The baseline for comparing the results is the case where the CDS archive > does not have any profiling data and there is no shared code archive. > The "premain" configuration refers to using a shared code archive and a > CDS archive with training data. > > Here are some initial results: > > 1. Startup: It is heartening to see start-up time improve by almost 11%. > > baseline 10.2s > premain 9.1s > > 2. Warmup: > This test measures the warmup time by applying load using 1 jmeter > thread to get an idea of the ramp-up time to reach the peak throughput. > The load is applied for the duration of 300 seconds. The graph [2] for > aot+profiling configuration shows interesting behavior. > In the initial period premain is ramping up faster than the baseline. > Then the slope of the curve for premain reduces significantly and a > couple of dips are also seen. Finally the throughput stabilizes. > It shows a drastic difference in the warmup time of the application when > running with the "premain" config. > > 3. Peak throughput: Last experiment is to measure peak throughput. It > starts with a warm-up phase of 180 seconds using 1 jmeter thread. After > the warmup phase the load is applied with 10 jmeter threads for a > duration of 5 mins. > Last two minutes of throughput is considered for measurement. The graph > [3] for this test shows almost a 10% drop in the throughput compared to > the baseline. > > > I am sure others would have done similar testing. My questions are: > > 1. Are these results on the expected lines? > 2. Are these tests using the CDS and the shared code (or cached code) > archives in the expected manner. > 3. Warmup time with the premain branch looks pretty bad which is > surprising. Is there any trick I missed in my tests? Is there anything > else that needs to be done to get better warmup time? > 4. What is the point of creating a new static archive? Shouldn't the > applications just create the dynamic archive? > 5. I am also wondering if there is any design doc that can be shared > that explains the AOT compilation strategy adopted in the premain branch? > > I have placed my scripts here [4] in case anyone wants to use them to > run these tests (you need to build the Petclinic app before using these > scripts). > > Please feel free to share your thoughts. > > [0] https://github.com/spring-projects/spring-petclinic <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$> > <https://github.com/spring-projects/spring-petclinic <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>> > [1] > https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$> <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>>
> [2] > https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>>
> [3] > https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>>
> [4] https://github.com/ashu-mehra/leyden-perf <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$> > <https://github.com/ashu-mehra/leyden-perf <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>> > > Thanks, > - Ashutosh Mehra
Thanks Vladimir and Ioi for the explanation on custom classloaders. If I understand correctly, the problem with the order of loading classes is not just limited to classloaders with side-effects, but also with clinit-s with side-effects. If the program output depends on the order the classes are loaded, irrespective of the classloader involved, the order would need to be preserved, right? If it is so, then why do custom classloaders pose a problem and not built-in loaders. If having init barriers for built-in loaders is sufficient, why isn't that the case for custom loaders? What am I missing? Thanks, - Ashutosh Mehra On Thu, Sep 7, 2023 at 10:08 PM <ioi.lam@oracle.com> wrote:
On 9/6/23 7:45 PM, Vladimir Ivanov wrote:
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet.
Can you please elaborate more about the support required for handling custom classloaders. Do they have an impact on AOT code quality or the training data?
As of now, there are a number of implementation-specific constraints imposed to simplify prototyping and experiments. I'm not sure about training data, but code caching is conservatively disabled for all classes loaded by custom loaders. Also, some recent CDS-specific enhancements in Leyden repository (like class preloading, constant pool entries pre-resolution) are disabled for custom loaders. So, as of today, code loaded by custom loaders doesn't benefit much from the work in Leyden.
Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Is there ongoing work to improve the support for custom loaders?
I'll let Ioi to comment on the plans about custom loaders. But I wouldn't expect much progress on that front in the short term.
TL/DR; in the near term, we can probably support only a limited set of custom class loader types, if at all.
The main problem with custom class loaders is they can have side effects that are observable at the Java level, so strictly we can't even change the order of invocations to ClassLoader.loadClass(). Otherwise we may change the meaning of the program. E.g.,
public static class MyLoader extends URLClassLoader { static int x; public MyLoader(URL[] urls, ClassLoader parent) { super(urls, parent); }
protected Class<?> loadClass(String name, boolean resolve) throws ClassNotFoundException { x <<= 1; x += name.hashCode(); System.out.println(x); return super.loadClass(name, resolve); } }
If our optimizations change the order of class loading, you will have unexpected output in stdout.
So if we want to time shift some computation that uses code loaded by MyLoader, how can we authentically preserve all the side effects and replay them in the production run?
Even if we remove the I/O from the example above, the value MyLoader.x is still observable by other Java code. So how do we make sure MyLoader.x has the correct value at every observation point in the production run?
For example, assume we have a "constant" expression that we want to fold:
class A { static int m() { return 1; }
class B { static int n() { return 2; }
class C {
static final int x = A.m() + B.n();
}
We not only need to remember "C.x is constant-folded to 3", but also -- C.<clinit> will first load A and then B.
This means we have to keep C.<clinit> around (even though it has an empty body as we removed the computation itself). We must trigger calls to C.<clinit> at all static references to C, in order to trigger the loading of A and B in the correct order.
As a result, all the AOT code that makes a static reference to C must take a class initialization barrier.
So you can see how complexity can quickly get out of hand. For example, we can't constant fold C.x in the AOT code because we need to preserve the class init barrier.
If you pre-compute a set of objects that include instances of C, the situation becomes even worse. You not only need to remember the values of these objects, but also the sequence of how they were constructed so you can replay the correct sequence of class loading ....
==============
For loaders that are side-effect free at the Java level (URLClassLoader??), maybe the story is simpler.
Thanks
- Ioi
Another thing that I want to check is the portability of the AOT code. Do we do anything to ensure the AOT code is portable across microarchitectures, that is, it is not tied to the CPU features of the system where the code is being generated. If we bundle the cached code archive in containers, which I expect would be one of the ways to deploy these archives, then the portability would come into picture.
We plan to address that in the future, but, so far, JVM does nothing special in that respect. Moreover, CDS archive itself imposes constraints on supported JVM modes (e.g., compressed oops and compressed class pointer modes should match). But users are free to specify any additional constraints during training process. For example, if shared code archive is generated with -XX:UseAVX=2, there won't be any AVX512 instructions present in the archived code which makes it safe to run on any AVX2-capable hardware.
Best regards, Vladimir Ivanov
On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov <vladimir.x.ivanov@oracle.com <mailto:vladimir.x.ivanov@oracle.com>> wrote:
Hi Ashutosh,
Thanks for giving it a try!
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet. It was the main limiting factor for new optimizations. Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Speaking of peak performance, some loss of performance is expected. Cached code is compiled conservatively (e.g., no constant folding for static final fields) so it can be reused in deployment runs. For now, the intended solution is to eventually recompile cached code online with all the optimizations enabled (have to be explicitly enabled -XX:+UseRecompilation). It's a work-in-progress and our experience using it was mixed: recompilation doesn't always fully restore peak performance.
But assuming that both CDS and cached code archive are underutilized (due to aforementioned reliance on custom loaders), 10% sounds way too big of a difference. I suggest to experiment with different flag combinations (e.g., turning ReplayTraining and LoadCachedCode on and off independently).
There's additional diagnostic output JVM produces which may help to observe effects from new optimizations during both training and deployment runs:
* -XX:+PrintCompilation: compilations satisfied from cached code archive are marked w/ "R";
* -XX:+CITime: prints information about cached code archive usage;
* -Xlog:init=info: produces additional information about some startup activities
* -XX:+PrintSharedArchiveAndExit additionally dumps training data and cached code archive info
* -Xlog:scc*=info and -Xlog:cds*=info print lots of additional information both during training and deployment
Hope it helps.
Best regards, Vladimir Ivanov
On 9/5/23 13:52, Ashutosh Mehra wrote: > Hi, > > We have been interested in persisting the profiling data in the CDS > archive with the intention of improving the application's warmup time. > And now that the premain branch is here that does save profile data > along with AOT, we started playing with the premain branch to understand > its impact on the performance. > > Our setup uses Springboot Petclinic [0] application and the CDS and > shared code archives are generated in a manner similar to this script [1]. > Our training run only covers the application startup phase. That means > at each step we start the application and shut it down without putting > any load on it. > > Using the archives thus generated I have done few experiments on my > local system. In these experiments the application is bound to two cpus. > The baseline for comparing the results is the case where the CDS archive > does not have any profiling data and there is no shared code archive. > The "premain" configuration refers to using a shared code archive and a > CDS archive with training data. > > Here are some initial results: > > 1. Startup: It is heartening to see start-up time improve by almost 11%. > > baseline 10.2s > premain 9.1s > > 2. Warmup: > This test measures the warmup time by applying load using 1 jmeter > thread to get an idea of the ramp-up time to reach the peak throughput. > The load is applied for the duration of 300 seconds. The graph [2] for > aot+profiling configuration shows interesting behavior. > In the initial period premain is ramping up faster than the baseline. > Then the slope of the curve for premain reduces significantly and a > couple of dips are also seen. Finally the throughput stabilizes. > It shows a drastic difference in the warmup time of the application when > running with the "premain" config. > > 3. Peak throughput: Last experiment is to measure peak throughput. It > starts with a warm-up phase of 180 seconds using 1 jmeter thread. After > the warmup phase the load is applied with 10 jmeter threads for a > duration of 5 mins. > Last two minutes of throughput is considered for measurement. The graph > [3] for this test shows almost a 10% drop in the throughput compared to > the baseline. > > > I am sure others would have done similar testing. My questions are: > > 1. Are these results on the expected lines? > 2. Are these tests using the CDS and the shared code (or cached code) > archives in the expected manner. > 3. Warmup time with the premain branch looks pretty bad which is > surprising. Is there any trick I missed in my tests? Is there anything > else that needs to be done to get better warmup time? > 4. What is the point of creating a new static archive? Shouldn't the > applications just create the dynamic archive? > 5. I am also wondering if there is any design doc that can be shared > that explains the AOT compilation strategy adopted in the premain branch? > > I have placed my scripts here [4] in case anyone wants to use them to > run these tests (you need to build the Petclinic app before using these > scripts). > > Please feel free to share your thoughts. > > [0] https://github.com/spring-projects/spring-petclinic < https://urldefense.com/v3/__https://github.com/spring-projects/spring-petcli...
https://urldefense.com/v3/__https://github.com/spring-projects/spring-petcli...
> [1] >
https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605...
< https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... < https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>>
> [2] >
https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s...
< https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... < https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>>
> [3] >
https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10....
< https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... < https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>>
https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!AC...
https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!AC...
> > Thanks, > - Ashutosh Mehra
By "class loading", I was referring to the fact that an InstanceKlasss is parsed from Java bytecodes and made available in the system dictionary. At this point, <clinit> of this class is not yet executed. In the current premain branch, we do not call <clinit> on arbitrary classes (only a very limited set of classes are initialized at dump time). We are exploring ways to extend the set of dump-time initialized classes. For example, we can run those that are proven to have no side effect. The init-barriers we have in the AOT code today is to ensure that <clinit> is called properly at runtime. Thanks - Ioi On 9/12/23 8:11 AM, Ashutosh Mehra wrote:
Thanks Vladimir and Ioi for the explanation on custom classloaders.
If I understand correctly, the problem with the order of loading classes is not just limited to classloaders with side-effects, but also with clinit-s with side-effects.
If the program output depends on the order the classes are loaded, irrespective of the classloader involved, the order would need to be preserved, right? If it is so, then why do custom classloaders pose a problem and not built-in loaders. If having init barriers for built-in loaders is sufficient, why isn't that the case for custom loaders? What am I missing?
Thanks, - Ashutosh Mehra
On Thu, Sep 7, 2023 at 10:08 PM <ioi.lam@oracle.com> wrote:
On 9/6/23 7:45 PM, Vladimir Ivanov wrote: > >> There were some experiments with PetClinic on our side before and >> it was >> noticed that the application relies on custom loaders which >> aren't fully >> supported yet. >> >> >> Can you please elaborate more about the support required for handling >> custom classloaders. >> Do they have an impact on AOT code quality or the training data? > > As of now, there are a number of implementation-specific constraints > imposed to simplify prototyping and experiments. I'm not sure about > training data, but code caching is conservatively disabled for all > classes loaded by custom loaders. Also, some recent CDS-specific > enhancements in Leyden repository (like class preloading, constant > pool entries pre-resolution) are disabled for custom loaders. So, as > of today, code loaded by custom loaders doesn't benefit much from the > work in Leyden. > >> Until proper support for custom loaders is there, I suggest to >> modify >> the benchmark so it relies only on existing system loaders. >> >> >> Is there ongoing work to improve the support for custom loaders? > > I'll let Ioi to comment on the plans about custom loaders. But I > wouldn't expect much progress on that front in the short term. > TL/DR; in the near term, we can probably support only a limited set of custom class loader types, if at all.
The main problem with custom class loaders is they can have side effects that are observable at the Java level, so strictly we can't even change the order of invocations to ClassLoader.loadClass(). Otherwise we may change the meaning of the program. E.g.,
public static class MyLoader extends URLClassLoader { static int x; public MyLoader(URL[] urls, ClassLoader parent) { super(urls, parent); }
protected Class<?> loadClass(String name, boolean resolve) throws ClassNotFoundException { x <<= 1; x += name.hashCode(); System.out.println(x); return super.loadClass(name, resolve); } }
If our optimizations change the order of class loading, you will have unexpected output in stdout.
So if we want to time shift some computation that uses code loaded by MyLoader, how can we authentically preserve all the side effects and replay them in the production run?
Even if we remove the I/O from the example above, the value MyLoader.x is still observable by other Java code. So how do we make sure MyLoader.x has the correct value at every observation point in the production run?
For example, assume we have a "constant" expression that we want to fold:
class A { static int m() { return 1; }
class B { static int n() { return 2; }
class C {
static final int x = A.m() + B.n();
}
We not only need to remember "C.x is constant-folded to 3", but also -- C.<clinit> will first load A and then B.
This means we have to keep C.<clinit> around (even though it has an empty body as we removed the computation itself). We must trigger calls to C.<clinit> at all static references to C, in order to trigger the loading of A and B in the correct order.
As a result, all the AOT code that makes a static reference to C must take a class initialization barrier.
So you can see how complexity can quickly get out of hand. For example, we can't constant fold C.x in the AOT code because we need to preserve the class init barrier.
If you pre-compute a set of objects that include instances of C, the situation becomes even worse. You not only need to remember the values of these objects, but also the sequence of how they were constructed so you can replay the correct sequence of class loading ....
==============
For loaders that are side-effect free at the Java level (URLClassLoader??), maybe the story is simpler.
Thanks
- Ioi
>> Another thing that I want to check is the portability of the AOT code. >> Do we do anything to ensure the AOT code is portable across >> microarchitectures, >> that is, it is not tied to the CPU features of the system where the >> code is being generated. >> If we bundle the cached code archive in containers, which I expect >> would be one of the ways to deploy these archives, >> then the portability would come into picture. > > We plan to address that in the future, but, so far, JVM does nothing > special in that respect. Moreover, CDS archive itself imposes > constraints on supported JVM modes (e.g., compressed oops and > compressed class pointer modes should match). But users are free to > specify any additional constraints during training process. For > example, if shared code archive is generated with -XX:UseAVX=2, there > won't be any AVX512 instructions present in the archived code which > makes it safe to run on any AVX2-capable hardware. > > Best regards, > Vladimir Ivanov > >> On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov >> <vladimir.x.ivanov@oracle.com <mailto:vladimir.x.ivanov@oracle.com>> >> wrote: >> >> Hi Ashutosh, >> >> Thanks for giving it a try! >> >> There were some experiments with PetClinic on our side before and it >> was >> noticed that the application relies on custom loaders which aren't >> fully >> supported yet. It was the main limiting factor for new >> optimizations. >> Until proper support for custom loaders is there, I suggest to >> modify >> the benchmark so it relies only on existing system loaders. >> >> Speaking of peak performance, some loss of performance is expected. >> Cached code is compiled conservatively (e.g., no constant folding >> for >> static final fields) so it can be reused in deployment runs. For >> now, >> the intended solution is to eventually recompile cached code online >> with >> all the optimizations enabled (have to be explicitly enabled >> -XX:+UseRecompilation). It's a work-in-progress and our experience >> using >> it was mixed: recompilation doesn't always fully restore peak >> performance. >> >> But assuming that both CDS and cached code archive are underutilized >> (due to aforementioned reliance on custom loaders), 10% sounds >> way too >> big of a difference. I suggest to experiment with different flag >> combinations (e.g., turning ReplayTraining and LoadCachedCode on and >> off >> independently). >> >> There's additional diagnostic output JVM produces which may help to >> observe effects from new optimizations during both training and >> deployment runs: >> >> * -XX:+PrintCompilation: compilations satisfied from cached code >> archive are marked w/ "R"; >> >> * -XX:+CITime: prints information about cached code archive >> usage; >> >> * -Xlog:init=info: produces additional information about some >> startup >> activities >> >> * -XX:+PrintSharedArchiveAndExit additionally dumps training data >> and >> cached code archive info >> >> * -Xlog:scc*=info and -Xlog:cds*=info print lots of additional >> information both during training and deployment >> >> Hope it helps. >> >> Best regards, >> Vladimir Ivanov >> >> On 9/5/23 13:52, Ashutosh Mehra wrote: >> > Hi, >> > >> > We have been interested in persisting the profiling data in >> the CDS >> > archive with the intention of improving the application's warmup >> time. >> > And now that the premain branch is here that does save profile >> data >> > along with AOT, we started playing with the premain branch to >> understand >> > its impact on the performance. >> > >> > Our setup uses Springboot Petclinic [0] application and the >> CDS and >> > shared code archives are generated in a manner similar to this >> script [1]. >> > Our training run only covers the application startup phase. That >> means >> > at each step we start the application and shut it down without >> putting >> > any load on it. >> > >> > Using the archives thus generated I have done few experiments >> on my >> > local system. In these experiments the application is bound to >> two cpus. >> > The baseline for comparing the results is the case where the CDS >> archive >> > does not have any profiling data and there is no shared code >> archive. >> > The "premain" configuration refers to using a shared code archive >> and a >> > CDS archive with training data. >> > >> > Here are some initial results: >> > >> > 1. Startup: It is heartening to see start-up time improve by >> almost 11%. >> > >> > baseline 10.2s >> > premain 9.1s >> > >> > 2. Warmup: >> > This test measures the warmup time by applying load using 1 >> jmeter >> > thread to get an idea of the ramp-up time to reach the peak >> throughput. >> > The load is applied for the duration of 300 seconds. The graph >> [2] for >> > aot+profiling configuration shows interesting behavior. >> > In the initial period premain is ramping up faster than the >> baseline. >> > Then the slope of the curve for premain reduces significantly >> and a >> > couple of dips are also seen. Finally the throughput stabilizes. >> > It shows a drastic difference in the warmup time of the >> application when >> > running with the "premain" config. >> > >> > 3. Peak throughput: Last experiment is to measure peak >> throughput. It >> > starts with a warm-up phase of 180 seconds using 1 jmeter thread. >> After >> > the warmup phase the load is applied with 10 jmeter threads for a >> > duration of 5 mins. >> > Last two minutes of throughput is considered for measurement. The >> graph >> > [3] for this test shows almost a 10% drop in the throughput >> compared to >> > the baseline. >> > >> > >> > I am sure others would have done similar testing. My >> questions are: >> > >> > 1. Are these results on the expected lines? >> > 2. Are these tests using the CDS and the shared code (or cached >> code) >> > archives in the expected manner. >> > 3. Warmup time with the premain branch looks pretty bad which is >> > surprising. Is there any trick I missed in my tests? Is there >> anything >> > else that needs to be done to get better warmup time? >> > 4. What is the point of creating a new static archive? >> Shouldn't the >> > applications just create the dynamic archive? >> > 5. I am also wondering if there is any design doc that can be >> shared >> > that explains the AOT compilation strategy adopted in the premain >> branch? >> > >> > I have placed my scripts here [4] in case anyone wants to use >> them to >> > run these tests (you need to build the Petclinic app before using >> these >> > scripts). >> > >> > Please feel free to share your thoughts. >> > >> > [0] https://github.com/spring-projects/spring-petclinic <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XUaP61Xw$> >> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$> >> > <https://github.com/spring-projects/spring-petclinic <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XUaP61Xw$> >> <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>> >> > [1] >> > >> https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XzhNGQpA$>
>> <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5XzhNGQpA$>
>> >> > [2] >> > >> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5V9byR7rA$>
>> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5V9byR7rA$>
>> >> > [3] >> > >> https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5UHm1Xhgg$>
>> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5UHm1Xhgg$>
>> >> > [4] https://github.com/ashu-mehra/leyden-perf <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5WezIkxBw$> >> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$> >> > <https://github.com/ashu-mehra/leyden-perf <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!LCs0zk6I2i6OOgGDoN6LQHirxXt-TPPwzPWAPUx5EpibFxCpaNTitfV7uVyKGn4gZqlwP5WezIkxBw$> >> <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>> >> > >> > Thanks, >> > - Ashutosh Mehra >>
Just to note: One of the key things a custom classloader can decide at runtime is to which loader it wants to delegate responsiblity for loading class bytes. This is arguably the most *pertinent* run-time dependent side-effect that you might bake in at build time, thereby invalidating expected behaviour. It's not the most significant run-time dependent side-effect, of course, because they are all equally significant. regards, Andrew Dinn ----------- On 08/09/2023 03:08, ioi.lam@oracle.com wrote:
On 9/6/23 7:45 PM, Vladimir Ivanov wrote:
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet.
Can you please elaborate more about the support required for handling custom classloaders. Do they have an impact on AOT code quality or the training data?
As of now, there are a number of implementation-specific constraints imposed to simplify prototyping and experiments. I'm not sure about training data, but code caching is conservatively disabled for all classes loaded by custom loaders. Also, some recent CDS-specific enhancements in Leyden repository (like class preloading, constant pool entries pre-resolution) are disabled for custom loaders. So, as of today, code loaded by custom loaders doesn't benefit much from the work in Leyden.
Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Is there ongoing work to improve the support for custom loaders?
I'll let Ioi to comment on the plans about custom loaders. But I wouldn't expect much progress on that front in the short term.
TL/DR; in the near term, we can probably support only a limited set of custom class loader types, if at all.
The main problem with custom class loaders is they can have side effects that are observable at the Java level, so strictly we can't even change the order of invocations to ClassLoader.loadClass(). Otherwise we may change the meaning of the program. E.g.,
public static class MyLoader extends URLClassLoader { static int x; public MyLoader(URL[] urls, ClassLoader parent) { super(urls, parent); }
protected Class<?> loadClass(String name, boolean resolve) throws ClassNotFoundException { x <<= 1; x += name.hashCode(); System.out.println(x); return super.loadClass(name, resolve); } }
If our optimizations change the order of class loading, you will have unexpected output in stdout.
So if we want to time shift some computation that uses code loaded by MyLoader, how can we authentically preserve all the side effects and replay them in the production run?
Even if we remove the I/O from the example above, the value MyLoader.x is still observable by other Java code. So how do we make sure MyLoader.x has the correct value at every observation point in the production run?
For example, assume we have a "constant" expression that we want to fold:
class A { static int m() { return 1; }
class B { static int n() { return 2; }
class C {
static final int x = A.m() + B.n();
}
We not only need to remember "C.x is constant-folded to 3", but also -- C.<clinit> will first load A and then B.
This means we have to keep C.<clinit> around (even though it has an empty body as we removed the computation itself). We must trigger calls to C.<clinit> at all static references to C, in order to trigger the loading of A and B in the correct order.
As a result, all the AOT code that makes a static reference to C must take a class initialization barrier.
So you can see how complexity can quickly get out of hand. For example, we can't constant fold C.x in the AOT code because we need to preserve the class init barrier.
If you pre-compute a set of objects that include instances of C, the situation becomes even worse. You not only need to remember the values of these objects, but also the sequence of how they were constructed so you can replay the correct sequence of class loading ....
==============
For loaders that are side-effect free at the Java level (URLClassLoader??), maybe the story is simpler.
Thanks
- Ioi
Another thing that I want to check is the portability of the AOT code. Do we do anything to ensure the AOT code is portable across microarchitectures, that is, it is not tied to the CPU features of the system where the code is being generated. If we bundle the cached code archive in containers, which I expect would be one of the ways to deploy these archives, then the portability would come into picture.
We plan to address that in the future, but, so far, JVM does nothing special in that respect. Moreover, CDS archive itself imposes constraints on supported JVM modes (e.g., compressed oops and compressed class pointer modes should match). But users are free to specify any additional constraints during training process. For example, if shared code archive is generated with -XX:UseAVX=2, there won't be any AVX512 instructions present in the archived code which makes it safe to run on any AVX2-capable hardware.
Best regards, Vladimir Ivanov
On Tue, Sep 5, 2023 at 8:41 PM Vladimir Ivanov <vladimir.x.ivanov@oracle.com <mailto:vladimir.x.ivanov@oracle.com>> wrote:
Hi Ashutosh,
Thanks for giving it a try!
There were some experiments with PetClinic on our side before and it was noticed that the application relies on custom loaders which aren't fully supported yet. It was the main limiting factor for new optimizations. Until proper support for custom loaders is there, I suggest to modify the benchmark so it relies only on existing system loaders.
Speaking of peak performance, some loss of performance is expected. Cached code is compiled conservatively (e.g., no constant folding for static final fields) so it can be reused in deployment runs. For now, the intended solution is to eventually recompile cached code online with all the optimizations enabled (have to be explicitly enabled -XX:+UseRecompilation). It's a work-in-progress and our experience using it was mixed: recompilation doesn't always fully restore peak performance.
But assuming that both CDS and cached code archive are underutilized (due to aforementioned reliance on custom loaders), 10% sounds way too big of a difference. I suggest to experiment with different flag combinations (e.g., turning ReplayTraining and LoadCachedCode on and off independently).
There's additional diagnostic output JVM produces which may help to observe effects from new optimizations during both training and deployment runs:
* -XX:+PrintCompilation: compilations satisfied from cached code archive are marked w/ "R";
* -XX:+CITime: prints information about cached code archive usage;
* -Xlog:init=info: produces additional information about some startup activities
* -XX:+PrintSharedArchiveAndExit additionally dumps training data and cached code archive info
* -Xlog:scc*=info and -Xlog:cds*=info print lots of additional information both during training and deployment
Hope it helps.
Best regards, Vladimir Ivanov
On 9/5/23 13:52, Ashutosh Mehra wrote: > Hi, > > We have been interested in persisting the profiling data in the CDS > archive with the intention of improving the application's warmup time. > And now that the premain branch is here that does save profile data > along with AOT, we started playing with the premain branch to understand > its impact on the performance. > > Our setup uses Springboot Petclinic [0] application and the CDS and > shared code archives are generated in a manner similar to this script [1]. > Our training run only covers the application startup phase. That means > at each step we start the application and shut it down without putting > any load on it. > > Using the archives thus generated I have done few experiments on my > local system. In these experiments the application is bound to two cpus. > The baseline for comparing the results is the case where the CDS archive > does not have any profiling data and there is no shared code archive. > The "premain" configuration refers to using a shared code archive and a > CDS archive with training data. > > Here are some initial results: > > 1. Startup: It is heartening to see start-up time improve by almost 11%. > > baseline 10.2s > premain 9.1s > > 2. Warmup: > This test measures the warmup time by applying load using 1 jmeter > thread to get an idea of the ramp-up time to reach the peak throughput. > The load is applied for the duration of 300 seconds. The graph [2] for > aot+profiling configuration shows interesting behavior. > In the initial period premain is ramping up faster than the baseline. > Then the slope of the curve for premain reduces significantly and a > couple of dips are also seen. Finally the throughput stabilizes. > It shows a drastic difference in the warmup time of the application when > running with the "premain" config. > > 3. Peak throughput: Last experiment is to measure peak throughput. It > starts with a warm-up phase of 180 seconds using 1 jmeter thread. After > the warmup phase the load is applied with 10 jmeter threads for a > duration of 5 mins. > Last two minutes of throughput is considered for measurement. The graph > [3] for this test shows almost a 10% drop in the throughput compared to > the baseline. > > > I am sure others would have done similar testing. My questions are: > > 1. Are these results on the expected lines? > 2. Are these tests using the CDS and the shared code (or cached code) > archives in the expected manner. > 3. Warmup time with the premain branch looks pretty bad which is > surprising. Is there any trick I missed in my tests? Is there anything > else that needs to be done to get better warmup time? > 4. What is the point of creating a new static archive? Shouldn't the > applications just create the dynamic archive? > 5. I am also wondering if there is any design doc that can be shared > that explains the AOT compilation strategy adopted in the premain branch? > > I have placed my scripts here [4] in case anyone wants to use them to > run these tests (you need to build the Petclinic app before using these > scripts). > > Please feel free to share your thoughts. > > [0] https://github.com/spring-projects/spring-petclinic <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$> > <https://github.com/spring-projects/spring-petclinic <https://urldefense.com/v3/__https://github.com/spring-projects/spring-petclinic__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa1ZwDrL4$>> > [1] > https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$> <https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605... <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/d960fb15258cc99a1bf7f0b1e94bd8be06605aad/test/hotspot/jtreg/premain/lib/premain-run.sh*L70-L101__;Iw!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGa4rgAjhQ$>> > [2] > https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.s... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t1.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaB0ccgkk$>> > [3] > https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$> <https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.... <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf/blob/main/spring/fd82682/tput-t10.svg__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGagpmiT9g$>> > [4] https://github.com/ashu-mehra/leyden-perf <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$> > <https://github.com/ashu-mehra/leyden-perf <https://urldefense.com/v3/__https://github.com/ashu-mehra/leyden-perf__;!!ACWV5N9M2RV99hQ!PpMTtwDD2_k-drLo0lLtZ1pybI_zZkMM7RH-TfRvfAEBwceCBkjYqfi7baTqI_r0e5f-qUZenTYkntGaY_ZcbT4$>> > > Thanks, > - Ashutosh Mehra
-- regards, Andrew Dinn ----------- Red Hat Distinguished Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill
(I’m putting in email something we discussed on a zoom call.) On the premain project, we are learning to perform special optimizations that depend on “well behaved” class loaders, because they are simply front-ends to declarative information like the class path. These optimizations shift work into a training run, storing resulting states into a CDS archive, and then adopting those states into a deployed application, quickly, as it starts up. (“But what about user-defined loaders? But why do we have to use CDS?” — See below for comments on these two side issues.) Ashutosh, you and your team have mentioned that there are tens of milliseconds (several percentage points of time) consumed during startup of some workloads by *failed* lookups. A logging framework may be querying for code resources and falling back somehow if they fail to load. The code probably has a try/catch that processes `ClassNotFoundException` or the like. We know that *successful* lookups go fast the second time because the VM caches the result in a central system dictionary. And, CDS technology makes successful lookups go fast the *first time*, if the lookup was performed in a training run and the resulting state stored in a CDS archive. (Those who watch our premain branch will see that there is lots of low-hanging fruit in CDS, that we are only beginning to enjoy.) But, a *failed* lookup is not recorded anywhere. So every distinct lookup must start again from first principles and fail all over again. For some workloads this costs a small but measurable percentage of startup time. The story is different for the local `CONSTANT_Class` entries in any given classfile: The JVMS mandates that both successful and failed lookups are recorded on the first attempt (per CP entry per se, not globally and not per class). Global usage includes both use of `Class.forName` and the “back end” logic for CP entry resolution. CP resolution is performed at most once per CP entry, and (win or lose) is made sticky on the CP itself, locally. To summarize, we can say that, for class lookup, both success and failure are “sticky” locally, and success is “sticky” globally, but failure is “not sticky” globally. The global behavior can be thought of either specific to a class loader (i.e., coded in JDK code) or as something in the VM or JNI code that works with the JDK code. In reality it is an emergent property of a number of small details in both. A *negative lookup cache* is a collection of class names (for a given loader) which have already failed to load. “Sticky failure” could be implemented with a negative lookup cache, either on a class loader (my preferred solution, I think) or else somewhere in the VM internals that participate in class loading paths. The benefits are obvious: Startup could be shorter by tens of milliseconds. The eliminated operations include re-creating exceptions, and throwing and catching them, and (maybe) uselessly re-probing the file system. The risks include at least two cases. First, a user might somehow contrive to extend the class path after a failure has been made sticky, and then the user could be disappointed when a class appears on the new class path components that satisfies the load. Second, a user might somehow contrive to mutate an existing class path component (by writing a file into a directory, say), and have the same disappointment of not seeing the classfile get picked up on the next request. But it seems to me that a negative lookup cache is a legitimate optimization *for well behaved class loaders*. (Please check my work here!) The preconditions are that the well behaved class takes its input from inputs that cannot be updated after the VM has started running. Or, if and when those inputs are updated somehow, the negative cache must be invalidated, at least for classes that could possibly be loaded from the updated parts. You can sometimes reason from the package prefix and from the class path updates that some name cannot be read from some class path element, just because of a missing directory. A CDS archive records its class path, and can detect whether that class path reads only from an immutable backing store. (This is a sweet spot for Leyden.) If that is the case, then the CDS archive could also store a negative lookup cache (for each eligible class loader). I think this should be done in Java code and the relevant field and its data special-cased to be retained via CDS. (I mean “special-cased” the way we already special-case some other selected data, like the module graph and integer box cache. As with framework-defined class loaders, we may have a conversation in the future about letting user code into this little game as well. But it has to be done in a way that does not violate any specification, which makes it challenging. One step at a time.) For immediate prototyping and testing of the concept, we don’t need to bring CDS into the picture. We can just have a global flag that says “it is safe to use a negative lookup cache”. But to roll out this optimization in a product, the flag needs to be automatically set to a safe value, probably by CDS at startup, based on in inspection of the class path settings in both training and deployment runs. And of course (as a separate step) we can pre-populate the caches at CDS dump time (that is, after a training run), so that the deployed application can immediately benefit from the cache, and spend zero time exploring the class path for classes that are known to be missing. BTW, I think it is just fine to throw a pre-constructed exception when the negative lookup cache hits, even though some users will complain that such exceptions are lacking meaningful messages and backtraces. It’s within spec. HotSpot does this for certain “hot throws” of built-in exceptions; see `GraphKit::builtin_throw`, and see also the tricky logic that makes failures sticky in CP entries (which edits down the exception information). As a compromise, the negative lookup cache could store an exception object whose message is the class name (but with no backtrace). There’s a another way to approach this issue, which is to index the class path in such a way that class loaders can respond to arbitrary load requests but do little or no work on failing requests. A Bloom filter is sometimes used in such cases to avoid many (not all) of the searches. But I think that’s overkill for the use cases we actually observe, which is a large number of failed lookups on a small number of class names. A per-loader table mapping a name to an exception seems to be a good tradeoff. And as I noted, CDS can pre-populate these things eventually. Ashutosh, maybe you are interested in working on some of this? :-) — John P.S. If the negative lookup cache has the right “stability” properties, we can even ask the JIT to think about optimizing failing `Class.forName` calls, by consulting the cache at compile time. In the Leyden setting, some `Class.forName` calls (not all) can be constant-folded. Perhaps the argument is semi-constant and can be profiled and speculated. Maybe some of that pays off, or maybe not; probably not since the `forName` call is probably buried in a stack of middleware. These are ideas for the JIT team to put on their very long list. P.P.S. Regarding the two side issues mentioned above… We are not at all forgetting about framework-defined class loaders. But for the next few months it is enough to assume that we will optimize only class loaders which are defined by the VM+JDK substrate. In the future we will want to investigate how to make framework-defined loaders compatible with whatever optimizations we create for the well behaved JDK class loaders. It it not yet time to discuss that in detail; it is time to learn the elements of our craft by working with the well behaved class loaders only. The same comment applies to the observation that we might try to “auto-train” applications. That is, get rid of the CDS archive, generated by a separate training run, and just automagically run the same application faster the second time, by capturing CDS-like states from the first run, treating it “secretly” as a training run. We know this can work well on some Java workloads. But we also like the predictability and simplicity of CDS. For HotSpot, it is not yet time to work on applying our learnings with CDS to the problem of auto-training. I hope that time will come after we have mined out more of the basic potential of CDS. For now we are working on the “one-step workflow”, where there is an explicit training phase that generates CDS. The “zero-step workflow” will comne in time.
We know that /successful/ lookups go fast the second time because the VM caches the result in a central system dictionary. And, CDS technology makes successful lookups go fast the /first time/, if the lookup was performed in a training run and the resulting state stored in a CDS archive. (Those who watch our premain branch will see that there is lots of low-hanging fruit in CDS, that we are only beginning to enjoy.)
Even though repeated successful lookups are already fast it is still benefitial to optimize them. For example, class pre-loading and CP entry pre-resolution are implemented in premain and do give noticeable startup improvements. And repeated successful lookups are common when it comes to Class.forName(). For example, PetClinic deployment run experiences 10k calls into JVM_FindClassFromCaller which cost ~20ms (measured on M1 Pro). So, while negative lookup cache looks like the lowest hanging fruit, it's worth to consider positive lookup caching scenario as well. Best regards, Vladimir Ivanov
But, a /failed/ lookup is not recorded anywhere. So every distinct lookup must start again from first principles and fail all over again. For some workloads this costs a small but measurable percentage of startup time.
The story is different for the local |CONSTANT_Class| entries in any given classfile: The JVMS mandates that both successful and failed lookups are recorded on the first attempt (per CP entry per se, not globally and not per class). Global usage includes both use of |Class.forName| and the “back end” logic for CP entry resolution. CP resolution is performed at most once per CP entry, and (win or lose) is made sticky on the CP itself, locally.
To summarize, we can say that, for class lookup, both success and failure are “sticky” locally, and success is “sticky” globally, but failure is “not sticky” globally.
The global behavior can be thought of either specific to a class loader (i.e., coded in JDK code) or as something in the VM or JNI code that works with the JDK code. In reality it is an emergent property of a number of small details in both.
A /negative lookup cache/ is a collection of class names (for a given loader) which have already failed to load. “Sticky failure” could be implemented with a negative lookup cache, either on a class loader (my preferred solution, I think) or else somewhere in the VM internals that participate in class loading paths.
The benefits are obvious: Startup could be shorter by tens of milliseconds. The eliminated operations include re-creating exceptions, and throwing and catching them, and (maybe) uselessly re-probing the file system.
The risks include at least two cases. First, a user might somehow contrive to extend the class path after a failure has been made sticky, and then the user could be disappointed when a class appears on the new class path components that satisfies the load. Second, a user might somehow contrive to mutate an existing class path component (by writing a file into a directory, say), and have the same disappointment of not seeing the classfile get picked up on the next request.
But it seems to me that a negative lookup cache is a legitimate optimization /for well behaved class loaders/. (Please check my work here!) The preconditions are that the well behaved class takes its input from inputs that cannot be updated after the VM has started running. Or, if and when those inputs are updated somehow, the negative cache must be invalidated, at least for classes that could possibly be loaded from the updated parts. You can sometimes reason from the package prefix and from the class path updates that some name cannot be read from some class path element, just because of a missing directory.
A CDS archive records its class path, and can detect whether that class path reads only from an immutable backing store. (This is a sweet spot for Leyden.) If that is the case, then the CDS archive could also store a negative lookup cache (for each eligible class loader). I think this should be done in Java code and the relevant field and its data special-cased to be retained via CDS.
(I mean “special-cased” the way we already special-case some other selected data, like the module graph and integer box cache. As with framework-defined class loaders, we may have a conversation in the future about letting user code into this little game as well. But it has to be done in a way that does not violate any specification, which makes it challenging. One step at a time.)
For immediate prototyping and testing of the concept, we don’t need to bring CDS into the picture. We can just have a global flag that says “it is safe to use a negative lookup cache”. But to roll out this optimization in a product, the flag needs to be automatically set to a safe value, probably by CDS at startup, based on in inspection of the class path settings in both training and deployment runs. And of course (as a separate step) we can pre-populate the caches at CDS dump time (that is, after a training run), so that the deployed application can immediately benefit from the cache, and spend zero time exploring the class path for classes that are known to be missing.
BTW, I think it is just fine to throw a pre-constructed exception when the negative lookup cache hits, even though some users will complain that such exceptions are lacking meaningful messages and backtraces. It’s within spec. HotSpot does this for certain “hot throws” of built-in exceptions; see |GraphKit::builtin_throw|, and see also the tricky logic that makes failures sticky in CP entries (which edits down the exception information). As a compromise, the negative lookup cache could store an exception object whose message is the class name (but with no backtrace).
There’s a another way to approach this issue, which is to index the class path in such a way that class loaders can respond to arbitrary load requests but do little or no work on failing requests. A Bloom filter is sometimes used in such cases to avoid many (not all) of the searches. But I think that’s overkill for the use cases we actually observe, which is a large number of failed lookups on a small number of class names. A per-loader table mapping a name to an exception seems to be a good tradeoff. And as I noted, CDS can pre-populate these things eventually.
Ashutosh, maybe you are interested in working on some of this? :-)
— John
P.S. If the negative lookup cache has the right “stability” properties, we can even ask the JIT to think about optimizing failing |Class.forName| calls, by consulting the cache at compile time. In the Leyden setting, some |Class.forName| calls (not all) can be constant-folded. Perhaps the argument is semi-constant and can be profiled and speculated. Maybe some of that pays off, or maybe not; probably not since the |forName| call is probably buried in a stack of middleware. These are ideas for the JIT team to put on their very long list.
P.P.S. Regarding the two side issues mentioned above…
We are not at all forgetting about framework-defined class loaders. But for the next few months it is enough to assume that we will optimize only class loaders which are defined by the VM+JDK substrate. In the future we will want to investigate how to make framework-defined loaders compatible with whatever optimizations we create for the well behaved JDK class loaders. It it not yet time to discuss that in detail; it is time to learn the elements of our craft by working with the well behaved class loaders only.
The same comment applies to the observation that we might try to “auto-train” applications. That is, get rid of the CDS archive, generated by a separate training run, and just automagically run the same application faster the second time, by capturing CDS-like states from the first run, treating it “secretly” as a training run. We know this can work well on some Java workloads. But we also like the predictability and simplicity of CDS. For HotSpot, it is not yet time to work on applying our learnings with CDS to the problem of auto-training. I hope that time will come after we have mined out more of the basic potential of CDS. For now we are working on the “one-step workflow”, where there is an explicit training phase that generates CDS. The “zero-step workflow” will comne in time.
Ashutosh, you and your team have mentioned that there are tens of milliseconds (several percentage points of time) consumed during startup of some workloads by *failed* lookups
While working on Vladimir's suggestion to profile JVM_FindClassFromCaller, I realized I had made a mistake in my earlier attempt to profile the Class.forName method. Sadly once I fixed that bug, the time spent in failed lookups is not that significant any more. This is the patch <https://github.com/ashu-mehra/leyden/commit/0bd59831b387358b60b9f38080ff09081512679a> I have for profiling Class.forName at Java level. It shows the time spent in Class.forName for negative lookups for app class loader. For Quarkus app that I am using, the patch reports 11ms which is 1.4% of the startup time (of 750 ms). For Springboot-petclinic app the patch reports 36ms which is 1.1% of the startup time (of 3250ms). The other patch <https://github.com/ashu-mehra/leyden/commit/3923adbb2a3e3291965dd5b85cb7a918db555117> I have is for profiling JVM_FindClassFromCaller when it throws an exception for the app classloader. For Quarkus app the patch reports 5ms. For Springboot-petclinic app the patch reports 25ms. Given these numbers, @JohnR do you think it is still worth spending time on the negative cache for the class loaders? And sorry for reporting incorrect numbers earlier. Thanks, - Ashutosh Mehra On Thu, Jan 11, 2024 at 7:39 PM Vladimir Ivanov < vladimir.x.ivanov@oracle.com> wrote:
We know that /successful/ lookups go fast the second time because the VM caches the result in a central system dictionary. And, CDS technology makes successful lookups go fast the /first time/, if the lookup was performed in a training run and the resulting state stored in a CDS archive. (Those who watch our premain branch will see that there is lots of low-hanging fruit in CDS, that we are only beginning to enjoy.)
Even though repeated successful lookups are already fast it is still benefitial to optimize them. For example, class pre-loading and CP entry pre-resolution are implemented in premain and do give noticeable startup improvements.
And repeated successful lookups are common when it comes to Class.forName(). For example, PetClinic deployment run experiences 10k calls into JVM_FindClassFromCaller which cost ~20ms (measured on M1 Pro).
So, while negative lookup cache looks like the lowest hanging fruit, it's worth to consider positive lookup caching scenario as well.
Best regards, Vladimir Ivanov
But, a /failed/ lookup is not recorded anywhere. So every distinct lookup must start again from first principles and fail all over again. For some workloads this costs a small but measurable percentage of startup time.
The story is different for the local |CONSTANT_Class| entries in any given classfile: The JVMS mandates that both successful and failed lookups are recorded on the first attempt (per CP entry per se, not globally and not per class). Global usage includes both use of |Class.forName| and the “back end” logic for CP entry resolution. CP resolution is performed at most once per CP entry, and (win or lose) is made sticky on the CP itself, locally.
To summarize, we can say that, for class lookup, both success and failure are “sticky” locally, and success is “sticky” globally, but failure is “not sticky” globally.
The global behavior can be thought of either specific to a class loader (i.e., coded in JDK code) or as something in the VM or JNI code that works with the JDK code. In reality it is an emergent property of a number of small details in both.
A /negative lookup cache/ is a collection of class names (for a given loader) which have already failed to load. “Sticky failure” could be implemented with a negative lookup cache, either on a class loader (my preferred solution, I think) or else somewhere in the VM internals that participate in class loading paths.
The benefits are obvious: Startup could be shorter by tens of milliseconds. The eliminated operations include re-creating exceptions, and throwing and catching them, and (maybe) uselessly re-probing the file system.
The risks include at least two cases. First, a user might somehow contrive to extend the class path after a failure has been made sticky, and then the user could be disappointed when a class appears on the new class path components that satisfies the load. Second, a user might somehow contrive to mutate an existing class path component (by writing a file into a directory, say), and have the same disappointment of not seeing the classfile get picked up on the next request.
But it seems to me that a negative lookup cache is a legitimate optimization /for well behaved class loaders/. (Please check my work here!) The preconditions are that the well behaved class takes its input from inputs that cannot be updated after the VM has started running. Or, if and when those inputs are updated somehow, the negative cache must be invalidated, at least for classes that could possibly be loaded from the updated parts. You can sometimes reason from the package prefix and from the class path updates that some name cannot be read from some class path element, just because of a missing directory.
A CDS archive records its class path, and can detect whether that class path reads only from an immutable backing store. (This is a sweet spot for Leyden.) If that is the case, then the CDS archive could also store a negative lookup cache (for each eligible class loader). I think this should be done in Java code and the relevant field and its data special-cased to be retained via CDS.
(I mean “special-cased” the way we already special-case some other selected data, like the module graph and integer box cache. As with framework-defined class loaders, we may have a conversation in the future about letting user code into this little game as well. But it has to be done in a way that does not violate any specification, which makes it challenging. One step at a time.)
For immediate prototyping and testing of the concept, we don’t need to bring CDS into the picture. We can just have a global flag that says “it is safe to use a negative lookup cache”. But to roll out this optimization in a product, the flag needs to be automatically set to a safe value, probably by CDS at startup, based on in inspection of the class path settings in both training and deployment runs. And of course (as a separate step) we can pre-populate the caches at CDS dump time (that is, after a training run), so that the deployed application can immediately benefit from the cache, and spend zero time exploring the class path for classes that are known to be missing.
BTW, I think it is just fine to throw a pre-constructed exception when the negative lookup cache hits, even though some users will complain that such exceptions are lacking meaningful messages and backtraces. It’s within spec. HotSpot does this for certain “hot throws” of built-in exceptions; see |GraphKit::builtin_throw|, and see also the tricky logic that makes failures sticky in CP entries (which edits down the exception information). As a compromise, the negative lookup cache could store an exception object whose message is the class name (but with no backtrace).
There’s a another way to approach this issue, which is to index the class path in such a way that class loaders can respond to arbitrary load requests but do little or no work on failing requests. A Bloom filter is sometimes used in such cases to avoid many (not all) of the searches. But I think that’s overkill for the use cases we actually observe, which is a large number of failed lookups on a small number of class names. A per-loader table mapping a name to an exception seems to be a good tradeoff. And as I noted, CDS can pre-populate these things eventually.
Ashutosh, maybe you are interested in working on some of this? :-)
— John
P.S. If the negative lookup cache has the right “stability” properties, we can even ask the JIT to think about optimizing failing |Class.forName| calls, by consulting the cache at compile time. In the Leyden setting, some |Class.forName| calls (not all) can be constant-folded. Perhaps the argument is semi-constant and can be profiled and speculated. Maybe some of that pays off, or maybe not; probably not since the |forName| call is probably buried in a stack of middleware. These are ideas for the JIT team to put on their very long list.
P.P.S. Regarding the two side issues mentioned above…
We are not at all forgetting about framework-defined class loaders. But for the next few months it is enough to assume that we will optimize only class loaders which are defined by the VM+JDK substrate. In the future we will want to investigate how to make framework-defined loaders compatible with whatever optimizations we create for the well behaved JDK class loaders. It it not yet time to discuss that in detail; it is time to learn the elements of our craft by working with the well behaved class loaders only.
The same comment applies to the observation that we might try to “auto-train” applications. That is, get rid of the CDS archive, generated by a separate training run, and just automagically run the same application faster the second time, by capturing CDS-like states from the first run, treating it “secretly” as a training run. We know this can work well on some Java workloads. But we also like the predictability and simplicity of CDS. For HotSpot, it is not yet time to work on applying our learnings with CDS to the problem of auto-training. I hope that time will come after we have mined out more of the basic potential of CDS. For now we are working on the “one-step workflow”, where there is an explicit training phase that generates CDS. The “zero-step workflow” will comne in time.
I think it's worth experimenting and see how much saving can be achieved. Although the current estimate may be small (11 ms out of 750 ms), it may be a code path that's difficult to optimize down the road. I think we can significantly improve over the current performance by shifting more Java computation to build time (e.g., making a heap snapshot of computed constants, running <clinit> at build time, etc). We should understand where the frameworks are doing such negative class lookups, and see if they can be time shifted (or otherwise avoided). If the answer is no, then I think it's worthwhile to implement a *runtime* negative lookup cache. As the overall start-up time goes down, the cost of negative class lookup will increase. For example, if it becomes 9 ms out of 250ms, then it will be more significant. Also, are you testing with "AOT" mode for Spring Petclinic -- it's a special packaging mode where a lot of the symbolic information are resolved at build time, so perhaps it will have a much lower use of negative class lookup? https://github.com/openjdk/leyden/tree/premain/test/hotspot/jtreg/premain/sp... Thanks - Ioi On 1/12/24 12:32 PM, Ashutosh Mehra wrote:
Ashutosh, you and your team have mentioned that there are tens of milliseconds (several percentage points of time) consumed during startup of some workloads by /failed/ lookups
While working on Vladimir's suggestion to profile JVM_FindClassFromCaller, I realized I had made a mistake in my earlier attempt to profile the Class.forName method. Sadly once I fixed that bug, the time spent in failed lookups is not that significant any more.
This is the patch <https://github.com/ashu-mehra/leyden/commit/0bd59831b387358b60b9f38080ff09081512679a> I have for profiling Class.forName at Java level. It shows the time spent in Class.forName for negative lookups for app class loader. For Quarkus app that I am using, the patch reports 11ms which is 1.4% of the startup time (of 750 ms). For Springboot-petclinic app the patch reports 36ms which is 1.1% of the startup time (of 3250ms).
The other patch <https://github.com/ashu-mehra/leyden/commit/3923adbb2a3e3291965dd5b85cb7a918db555117> I have is for profiling JVM_FindClassFromCaller when it throws an exception for the app classloader. For Quarkus app the patch reports 5ms. For Springboot-petclinic app the patch reports 25ms.
Given these numbers, @JohnR do you think it is still worth spending time on the negative cache for the class loaders? And sorry for reporting incorrect numbers earlier.
Thanks, - Ashutosh Mehra
On Thu, Jan 11, 2024 at 7:39 PM Vladimir Ivanov <vladimir.x.ivanov@oracle.com> wrote:
> We know that /successful/ lookups go fast the second time because the VM > caches the result in a central system dictionary. And, CDS technology > makes successful lookups go fast the /first time/, if the lookup was > performed in a training run and the resulting state stored in a CDS > archive. (Those who watch our premain branch will see that there is lots > of low-hanging fruit in CDS, that we are only beginning to enjoy.)
Even though repeated successful lookups are already fast it is still benefitial to optimize them. For example, class pre-loading and CP entry pre-resolution are implemented in premain and do give noticeable startup improvements.
And repeated successful lookups are common when it comes to Class.forName(). For example, PetClinic deployment run experiences 10k calls into JVM_FindClassFromCaller which cost ~20ms (measured on M1 Pro).
So, while negative lookup cache looks like the lowest hanging fruit, it's worth to consider positive lookup caching scenario as well.
Best regards, Vladimir Ivanov
> But, a /failed/ lookup is not recorded anywhere. So every distinct > lookup must start again from first principles and fail all over again. > For some workloads this costs a small but measurable percentage of > startup time. > > The story is different for the local |CONSTANT_Class| entries in any > given classfile: The JVMS mandates that both successful and failed > lookups are recorded on the first attempt (per CP entry per se, not > globally and not per class). Global usage includes both use of > |Class.forName| and the “back end” logic for CP entry resolution. CP > resolution is performed at most once per CP entry, and (win or lose) is > made sticky on the CP itself, locally. > > To summarize, we can say that, for class lookup, both success and > failure are “sticky” locally, and success is “sticky” globally, but > failure is “not sticky” globally. > > The global behavior can be thought of either specific to a class loader > (i.e., coded in JDK code) or as something in the VM or JNI code that > works with the JDK code. In reality it is an emergent property of a > number of small details in both. > > A /negative lookup cache/ is a collection of class names (for a given > loader) which have already failed to load. “Sticky failure” could be > implemented with a negative lookup cache, either on a class loader (my > preferred solution, I think) or else somewhere in the VM internals that > participate in class loading paths. > > The benefits are obvious: Startup could be shorter by tens of > milliseconds. The eliminated operations include re-creating exceptions, > and throwing and catching them, and (maybe) uselessly re-probing the > file system. > > The risks include at least two cases. First, a user might somehow > contrive to extend the class path after a failure has been made sticky, > and then the user could be disappointed when a class appears on the new > class path components that satisfies the load. Second, a user might > somehow contrive to mutate an existing class path component (by writing > a file into a directory, say), and have the same disappointment of not > seeing the classfile get picked up on the next request. > > But it seems to me that a negative lookup cache is a legitimate > optimization /for well behaved class loaders/. (Please check my work > here!) The preconditions are that the well behaved class takes its input > from inputs that cannot be updated after the VM has started running. Or, > if and when those inputs are updated somehow, the negative cache must be > invalidated, at least for classes that could possibly be loaded from the > updated parts. You can sometimes reason from the package prefix and from > the class path updates that some name cannot be read from some class > path element, just because of a missing directory. > > A CDS archive records its class path, and can detect whether that class > path reads only from an immutable backing store. (This is a sweet spot > for Leyden.) If that is the case, then the CDS archive could also store > a negative lookup cache (for each eligible class loader). I think this > should be done in Java code and the relevant field and its data > special-cased to be retained via CDS. > > (I mean “special-cased” the way we already special-case some other > selected data, like the module graph and integer box cache. As with > framework-defined class loaders, we may have a conversation in the > future about letting user code into this little game as well. But it has > to be done in a way that does not violate any specification, which makes > it challenging. One step at a time.) > > For immediate prototyping and testing of the concept, we don’t need to > bring CDS into the picture. We can just have a global flag that says “it > is safe to use a negative lookup cache”. But to roll out this > optimization in a product, the flag needs to be automatically set to a > safe value, probably by CDS at startup, based on in inspection of the > class path settings in both training and deployment runs. And of course > (as a separate step) we can pre-populate the caches at CDS dump time > (that is, after a training run), so that the deployed application can > immediately benefit from the cache, and spend zero time exploring the > class path for classes that are known to be missing. > > BTW, I think it is just fine to throw a pre-constructed exception when > the negative lookup cache hits, even though some users will complain > that such exceptions are lacking meaningful messages and backtraces. > It’s within spec. HotSpot does this for certain “hot throws” of built-in > exceptions; see |GraphKit::builtin_throw|, and see also the tricky logic > that makes failures sticky in CP entries (which edits down the exception > information). As a compromise, the negative lookup cache could store an > exception object whose message is the class name (but with no backtrace). > > There’s a another way to approach this issue, which is to index the > class path in such a way that class loaders can respond to arbitrary > load requests but do little or no work on failing requests. A Bloom > filter is sometimes used in such cases to avoid many (not all) of the > searches. But I think that’s overkill for the use cases we actually > observe, which is a large number of failed lookups on a small number of > class names. A per-loader table mapping a name to an exception seems to > be a good tradeoff. And as I noted, CDS can pre-populate these things > eventually. > > Ashutosh, maybe you are interested in working on some of this? :-) > > — John > > P.S. If the negative lookup cache has the right “stability” properties, > we can even ask the JIT to think about optimizing failing > |Class.forName| calls, by consulting the cache at compile time. In the > Leyden setting, some |Class.forName| calls (not all) can be > constant-folded. Perhaps the argument is semi-constant and can be > profiled and speculated. Maybe some of that pays off, or maybe not; > probably not since the |forName| call is probably buried in a stack of > middleware. These are ideas for the JIT team to put on their very long list. > > P.P.S. Regarding the two side issues mentioned above… > > We are not at all forgetting about framework-defined class loaders. But > for the next few months it is enough to assume that we will optimize > only class loaders which are defined by the VM+JDK substrate. In the > future we will want to investigate how to make framework-defined loaders > compatible with whatever optimizations we create for the well behaved > JDK class loaders. It it not yet time to discuss that in detail; it is > time to learn the elements of our craft by working with the well behaved > class loaders only. > > The same comment applies to the observation that we might try to > “auto-train” applications. That is, get rid of the CDS archive, > generated by a separate training run, and just automagically run the > same application faster the second time, by capturing CDS-like states > from the first run, treating it “secretly” as a training run. We know > this can work well on some Java workloads. But we also like the > predictability and simplicity of CDS. For HotSpot, it is not yet time to > work on applying our learnings with CDS to the problem of auto-training. > I hope that time will come after we have mined out more of the basic > potential of CDS. For now we are working on the “one-step workflow”, > where there is an explicit training phase that generates CDS. The > “zero-step workflow” will comne in time. >
On Thu, Jan 11, 2024 at 4:15 PM John Rose <john.r.rose@oracle.com> wrote:
[...] But, a *failed* lookup is not recorded anywhere. So every distinct lookup must start again from first principles and fail all over again. For some workloads this costs a small but measurable percentage of startup time.
The story is different for the local CONSTANT_Class entries in any given classfile: The JVMS mandates that both successful and failed lookups are recorded on the first attempt (per CP entry per se, not globally and not per class). Global usage includes both use of Class.forName and the “back end” logic for CP entry resolution. CP resolution is performed at most once per CP entry, and (win or lose) is made sticky on the CP itself, locally.
To summarize, we can say that, for class lookup, both success and failure are “sticky” locally, and success is “sticky” globally, but failure is “not sticky” globally.
We have implemented a negative lookup cache in the past in certain of our middleware products' custom class loaders, and while they *can* help, I seem to recall that we had some trouble with cache explosion in some scenarios involving certain frameworks. We abandoned that approach quite a long time ago though in favor of a more optimized index-by-package scheme which ensures that all lookups, success or failure, run in (more or less) constant time, which had essentially eliminated the issue for us for that use case. FWIW I do think that it could be potentially nifty to constant-fold negative lookup cases where that is possible. -- - DML • he/him
Hi, The startup of a Spring Boot applications involves common auto-configuration checks which would definitely benefit from negative lookup caching at the ClassLoader level, quickly indicating which parts of the infrastructure are not present at runtime. Spring AOT optimizations can precompute those checks, but involve side effects and constraints that probably mean we won't enable them by default for the foreseeable future. The Spring codebase also contains various class presence checks like the ones in WebMvcConfigurationSupport [1] that will be performed regardless of Spring AOT optimizations, with typically more negative lookups than positive ones. Best regards, Sébastien Deleuze [1] https://github.com/spring-projects/spring-framework/blob/7e511931b305edc84eb... On Tue, Jan 16, 2024 at 5:57 PM David Lloyd <david.lloyd@redhat.com> wrote:
On Thu, Jan 11, 2024 at 4:15 PM John Rose <john.r.rose@oracle.com> wrote:
[...] But, a *failed* lookup is not recorded anywhere. So every distinct lookup must start again from first principles and fail all over again. For some workloads this costs a small but measurable percentage of startup time.
The story is different for the local CONSTANT_Class entries in any given classfile: The JVMS mandates that both successful and failed lookups are recorded on the first attempt (per CP entry per se, not globally and not per class). Global usage includes both use of Class.forName and the “back end” logic for CP entry resolution. CP resolution is performed at most once per CP entry, and (win or lose) is made sticky on the CP itself, locally.
To summarize, we can say that, for class lookup, both success and failure are “sticky” locally, and success is “sticky” globally, but failure is “not sticky” globally.
We have implemented a negative lookup cache in the past in certain of our middleware products' custom class loaders, and while they *can* help, I seem to recall that we had some trouble with cache explosion in some scenarios involving certain frameworks. We abandoned that approach quite a long time ago though in favor of a more optimized index-by-package scheme which ensures that all lookups, success or failure, run in (more or less) constant time, which had essentially eliminated the issue for us for that use case.
FWIW I do think that it could be potentially nifty to constant-fold negative lookup cases where that is possible.
-- - DML • he/him
-- This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.
Good point, Sebastien. I wasn't aware lookup caches are part of AOT-preprocessing Spring does. I made a quick experiment with PetClinic and observed the following behavior for Class.forName [1]: * -Dspring.aot.enabled=false: 792 exceptions thrown (293 unique names) * -Dspring.aot.enabled=true: 512 exceptions thrown (141 unique names) Speaking of requests originated in org/springframework/util/ClassUtils.isPresent(): * -Dspring.aot.enabled=false: 251 exceptions thrown (151 unique names) * -Dspring.aot.enabled=true: 63 exceptions thrown (41 unique names) Still, the accumulated time spent in Class.forName() hints that positive lookups dominate negative ones: Baseline (23-ea): -Dspring.aot.enabled=false: 127ms (15873 events) -Dspring.aot.enabled=true: 78ms (10185 events) Leyden/premain deployment run: -Dspring.aot.enabled=false: 37ms (15859 events) -Dspring.aot.enabled=true: 18ms (10171 events) Best regards, Vladimir Ivanov [1] https://gist.github.com/iwanowww/395f1be2f4890641fbeacb35ff503b49 On 1/19/24 03:20, Sebastien Deleuze wrote:
Hi,
The startup of a Spring Boot applications involves common auto-configuration checks which would definitely benefit from negative lookup caching at the ClassLoader level, quickly indicating which parts of the infrastructure are not present at runtime. Spring AOT optimizations can precompute those checks, but involve side effects and constraints that probably mean we won't enable them by default for the foreseeable future. The Spring codebase also contains various class presence checks like the ones in WebMvcConfigurationSupport [1] that will be performed regardless of Spring AOT optimizations, with typically more negative lookups than positive ones.
Best regards, Sébastien Deleuze
[1] https://github.com/spring-projects/spring-framework/blob/7e511931b305edc84eb... <https://github.com/spring-projects/spring-framework/blob/7e511931b305edc84eb04a219c277a6c8fdcba59/spring-webmvc/src/main/java/org/springframework/web/servlet/config/annotation/WebMvcConfigurationSupport.java#L213-L227>
On Tue, Jan 16, 2024 at 5:57 PM David Lloyd <david.lloyd@redhat.com <mailto:david.lloyd@redhat.com>> wrote:
On Thu, Jan 11, 2024 at 4:15 PM John Rose <john.r.rose@oracle.com <mailto:john.r.rose@oracle.com>> wrote:
__
[...] But, a /failed/ lookup is not recorded anywhere. So every distinct lookup must start again from first principles and fail all over again. For some workloads this costs a small but measurable percentage of startup time.
The story is different for the local |CONSTANT_Class| entries in any given classfile: The JVMS mandates that both successful and failed lookups are recorded on the first attempt (per CP entry per se, not globally and not per class). Global usage includes both use of |Class.forName| and the “back end” logic for CP entry resolution. CP resolution is performed at most once per CP entry, and (win or lose) is made sticky on the CP itself, locally.
To summarize, we can say that, for class lookup, both success and failure are “sticky” locally, and success is “sticky” globally, but failure is “not sticky” globally.
We have implemented a negative lookup cache in the past in certain of our middleware products' custom class loaders, and while they *can* help, I seem to recall that we had some trouble with cache explosion in some scenarios involving certain frameworks. We abandoned that approach quite a long time ago though in favor of a more optimized index-by-package scheme which ensures that all lookups, success or failure, run in (more or less) constant time, which had essentially eliminated the issue for us for that use case.
FWIW I do think that it could be potentially nifty to constant-fold negative lookup cases where that is possible.
-- - DML • he/him
This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.
participants (7)
-
Andrew Dinn
-
Ashutosh Mehra
-
David Lloyd
-
ioi.lam@oracle.com
-
John Rose
-
Sebastien Deleuze
-
Vladimir Ivanov