RFR: 8366681: [leyden] Precompile more C1 code
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation. 1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt. 2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case. 3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well. Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit. Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice. Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds` ------------- Commit messages: - Fix Changes: https://git.openjdk.org/leyden/pull/93/files Webrev: https://webrevs.openjdk.org/?repo=leyden&pr=93&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8366681 Stats: 121 lines in 4 files changed: 64 ins; 27 del; 30 mod Patch: https://git.openjdk.org/leyden/pull/93.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/93/head:pull/93 PR: https://git.openjdk.org/leyden/pull/93
On Tue, 2 Sep 2025 10:29:33 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Test results: Baselines: default and C1 only: Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ -cp JavacBenchApp.jar JavacBenchApp 50 Time (mean ± σ): 1.052 s ± 0.014 s [User: 3.694 s, System: 0.144 s] Range (min … max): 1.029 s … 1.091 s 30 runs Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ -XX:TieredStopAtLevel=1 \ -cp JavacBenchApp.jar JavacBenchApp 50 Time (mean ± σ): 795.4 ms ± 5.4 ms [User: 1469.7 ms, System: 109.0 ms] Range (min … max): 787.3 ms … 809.4 ms 30 runs Premain baseline: Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ -XX:AOTCache=app.aot \ -cp JavacBenchApp.jar JavacBenchApp 50 Time (mean ± σ): 584.4 ms ± 23.1 ms [User: 2177.1 ms, System: 156.4 ms] Range (min … max): 540.6 ms … 640.7 ms 30 runs Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ -XX:AOTCache=app.aot -XX:TieredStopAtLevel=1 \ -cp JavacBenchApp.jar JavacBenchApp 50 Time (mean ± σ): 893.6 ms ± 7.5 ms [User: 1670.2 ms, System: 70.2 ms] Range (min … max): 881.5 ms … 921.4 ms 30 runs Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ -XX:AOTCache=app.aot -XX:+UnlockExperimentalVMOptions -XX:+PreloadOnly \ -cp JavacBenchApp.jar JavacBenchApp 50 Time (mean ± σ): 474.7 ms ± 3.9 ms [User: 470.8 ms, System: 57.2 ms] Range (min … max): 468.0 ms … 482.5 ms 30 runs Premain fixed: Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ -XX:AOTCache=app.aot \ -cp JavacBenchApp.jar JavacBenchApp 50 Time (mean ± σ): 483.8 ms ± 8.7 ms [User: 1454.7 ms, System: 167.3 ms] Range (min … max): 471.2 ms … 510.4 ms 30 runs Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ -XX:AOTCache=app.aot -XX:TieredStopAtLevel=1 \ -cp JavacBenchApp.jar JavacBenchApp 50 Time (mean ± σ): 440.0 ms ± 3.6 ms [User: 469.2 ms, System: 69.5 ms] Range (min … max): 434.4 ms … 447.7 ms 30 runs Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ -XX:AOTCache=app.aot -XX:+UnlockExperimentalVMOptions -XX:+PreloadOnly \ -cp JavacBenchApp.jar JavacBenchApp 50 Time (mean ± σ): 435.8 ms ± 3.3 ms [User: 440.5 ms, System: 63.1 ms] Range (min … max): 429.0 ms … 441.7 ms 30 runs Note massive improvements in both out of the box and C1-only modes. Out of the box mode improves significantly, because there are unfortunate deopts from A4, which involve the compilers to generate new methods. It looks that pre-compiling A2 code allows this process to reach T4 code with fewer overheads. C1-only improves significantly, because now we have A1 methods in AOT cache. It even gets very close to C2-AOT-only (`+PreloadOnly`) mode! There are more things to do on that path: I have a patch that build on this and implements the hybrid C2 AOT + C1 JIT mode, beating all these configs by 25% more. I suspect a modest improvement to `+PreloadOnly` is likely due to new method sorting code that sorts by hotness. The downside is the size of AOTCache. For Javac test, the AOTCache grows from `54M` to `78M`. I would think this is a fair price for much flatter performance model, _and_ I think we need to deal with generated code density more thoroughly anyway. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3244753665 PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3244762639
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix ------------- Changes: - all: https://git.openjdk.org/leyden/pull/93/files - new: https://git.openjdk.org/leyden/pull/93/files/0bdea338..fe24290f Webrevs: - full: https://webrevs.openjdk.org/?repo=leyden&pr=93&range=01 - incr: https://webrevs.openjdk.org/?repo=leyden&pr=93&range=00-01 Stats: 114 lines in 9 files changed: 36 ins; 30 del; 48 mod Patch: https://git.openjdk.org/leyden/pull/93.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/93/head:pull/93 PR: https://git.openjdk.org/leyden/pull/93
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix ------------- Changes: - all: https://git.openjdk.org/leyden/pull/93/files - new: https://git.openjdk.org/leyden/pull/93/files/fe24290f..ee1e5672 Webrevs: - full: https://webrevs.openjdk.org/?repo=leyden&pr=93&range=02 - incr: https://webrevs.openjdk.org/?repo=leyden&pr=93&range=01-02 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/leyden/pull/93.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/93/head:pull/93 PR: https://git.openjdk.org/leyden/pull/93
On Wed, 3 Sep 2025 16:59:46 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
- Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix
Running other benchmarks shows improvement as well. It is mostly a wash when you have lots of CPUs available to absorb the compilation overhead. But when you constrain the resources, the JIT compilers start to compete with the application quite hard. I am seeing 5..10% improvements on the benchmarks with this change. spring-petclinic: $ ls -lah *.aot -rw-rw-r-- 1 shade shade 157M Sep 4 10:07 spring-petclinic.new.aot -rw-rw-r-- 1 shade shade 138M Sep 4 10:06 spring-petclinic.old.aot $ taskset -c 0-3 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,1535,1442 2,1524,1437 3,1514,1434 4,1526,1434 5,1543,1434 6,1522,1442 7,1527,1431 8,1543,1432 9,1526,1441 10,1524,1441 Geomean,1528.37,1436.79 (-6.4%) Stdev,8.78,4.12 $ taskset -c 0-1 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,1901,1847 2,1896,1838 3,1943,1809 4,1886,1743 5,1868,1787 6,1882,1847 7,1868,1777 8,1874,1781 9,1896,1758 10,1869,1803 Geomean,1888.18,1798.67 (-5.0%) Stdev,21.72,34.66 $ taskset -c 0-0 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,3706,3418 2,3581,3361 3,3596,3368 4,3597,3312 5,3581,3361 6,3608,3373 7,3668,3391 8,3576,3432 9,3701,3443 10,3686,3343 Geomean,3629.65,3379.98 (-7.4%) Stdev,50.84,38.92 quarkus-getting-started: $ ls -lah *.aot -rw-rw-r-- 1 shade shade 47M Sep 4 10:02 quarkus-getting-started.new.aot -rw-rw-r-- 1 shade shade 42M Sep 4 10:02 quarkus-getting-started.old.aot $ taskset -c 0-3 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,171,165 2,174,167 3,176,166 4,176,169 5,170,165 6,176,165 7,173,164 8,176,169 9,178,166 10,172,166 Geomean,174.18,166.19 (-4.8%) Stdev,2.48,1.60 $ taskset -c 0-1 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,221,211 2,227,214 3,223,217 4,242,206 5,224,202 6,218,200 7,225,217 8,235,208 9,235,205 10,235,226 Geomean,228.38,210.46 (-8.6%) Stdev,7.35,7.59 $ taskset -c 0-0 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,418,379 2,424,367 3,415,384 4,421,367 5,418,388 6,421,378 7,411,376 8,420,371 9,404,375 10,429,380 Geomean,418.05,376.44 (-11.2%) Stdev,6.58,6.50 helidon-quickstart-se: $ ls -lah *.aot -rw-rw-r-- 1 shade shade 37M Sep 4 10:11 helidon-quickstart-se.new.aot -rw-rw-r-- 1 shade shade 32M Sep 4 10:11 helidon-quickstart-se.old.aot $ taskset -c 0-3 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,118,116 2,118,117 3,117,116 4,118,115 5,117,116 6,117,117 7,119,116 8,117,116 9,118,117 10,119,118 Geomean,117.80,116.40 (-1.2%) Stdev,0.75,0.80 $ taskset -c 0-1 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,151,155 2,150,146 3,154,145 4,155,144 5,153,134 6,160,159 7,153,150 8,159,155 9,162,156 10,162,150 Geomean,155.84,149.23 (-4.3%) Stdev,4.25,7.05 $ taskset -c 0-0 make ... compare_premain_builds Run,Old CDS + AOT,New CDS + AOT 1,286,257 2,282,268 3,286,267 4,284,248 5,290,260 6,305,266 7,282,271 8,290,263 9,286,257 10,291,265 Geomean,288.13,262.12 (-9.9%) Stdev,6.37,6.46 ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3252479967
On Wed, 3 Sep 2025 16:59:46 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
- Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix
Interesting idea. 2 observations: * There are already AP4 versions present in the AOT cache. Why can't they be used instead of A2 until T3 arrives? * There were some rare discrepancies in compilation behavior between training and production runs which lead to AOT-cache mismatches and trigger unnecessary JIT-compilations. While proposed change alleviates the symptoms, It's beneficial to fix the root cause. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3255259379
On Thu, 4 Sep 2025 19:20:41 GMT, Vladimir Ivanov <vlivanov@openjdk.org> wrote:
Interesting idea.
It is very profitable on all workloads I tried, so I would like to get this in.
* There are already AP4 versions present in the AOT cache. Why can't they be used instead of A2 until T3 arrives?
Maybe? I suspect that switching to A2 gives us a natural way to hook up to "normal" tiered policy: it has counters inside, and will notify tiered policy when it is time to progress it to T3 and T4. I am not sure how would switching to AP4 work to trigger T3 and T4 compiles in this case. Plus, what if AP4 still, despite our best efforts, traps? Would it construct the loop like AP4 --trap--> A2 --upgrade--> A4 --trap--> AP4 --trap--> ...? That is also not clear to me. All I am saying that storing A1/A2 code keeps the model simple enough to reason, while having substantial performance benefits.
* There were some rare discrepancies in compilation behavior between training and production runs which lead to AOT-cache mismatches and trigger unnecessary JIT-compilations. While proposed change alleviates the symptoms, It's beneficial to fix the root cause.
That is true. However, at this point, I think we should try and make the best of the training done, even if that training not 100% accurate. For that, IMO, it is better to have a faster recovery path, e.g. by carrying A1 and A2 code. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3302426491
On Wed, 17 Sep 2025 10:50:47 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
It is very profitable on all workloads I tried, so I would like to get this in.
It's a good justification for a stop-the-gap fix. But from a design perspective it doesn't look that attractive. Basically, you introduce 3rd version of code for hot code which is unconditionally generated, but it covers a rare case (at least, it's intended to be rare) when A4 code is invalidated during execution. So, you trade footprint for startup. The improvements you observe with the patch may be a signal there are inefficiencies in our current implementation. And fixing those will improve startup without sacrificing footprint. Some more thoughts: * C1 and C2 compilations aren't equivalent, especially when it comes to inlining decisions; so, when the same method is compiled with C1, it won't necessarily cover the same amount of application code. * Why cache A2 and not A3? Invalidation of A4 code signals that training data is not representative and, most likely, reprofiling is needed anyway. Or maybe just a T4 recompilation is enough? * AP4 is intended to eventually become a baseline version which can be used in a wide range of circumstances; switching back to AP4 version while waiting for T3/T4 recompilation task to finish fits that goal well.
Plus, what if AP4 still, despite our best efforts, traps? Is it enough to justify one more cached version for every hot method? And, theoretically, C1-generated code can trap as well.
Overall, proposed patch looks good as a stop-the-gap measure, but IMO longer term it doesn't fit current design well. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3304797955
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix ------------- Changes: - all: https://git.openjdk.org/leyden/pull/93/files - new: https://git.openjdk.org/leyden/pull/93/files/ee1e5672..f17e5a73 Webrevs: - full: https://webrevs.openjdk.org/?repo=leyden&pr=93&range=03 - incr: https://webrevs.openjdk.org/?repo=leyden&pr=93&range=02-03 Stats: 10794 lines in 859 files changed: 6082 ins; 1947 del; 2765 mod Patch: https://git.openjdk.org/leyden/pull/93.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/93/head:pull/93 PR: https://git.openjdk.org/leyden/pull/93
On Wed, 17 Sep 2025 10:37:37 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
- Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix
Try setting `SkipTier2IfPossible=true`. I think we mismerged the mainline at some point. It probably should be `true` if we're running with AOT code. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3309091798
On Wed, 17 Sep 2025 10:37:37 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
- Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix
I was finally able to find the root cause for this behavior: https://github.com/openjdk/leyden/pull/97 ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3312164243
On Wed, 17 Sep 2025 10:37:37 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
- Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix
Re-measured after https://github.com/openjdk/leyden/pull/97 integration: some performance improvement is _still there_, but I attribute it to better precompilation policy code. I will PR those improvements separately. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3323662591
On Wed, 17 Sep 2025 10:37:37 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
- Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix
My experiment with AOT A[P}4 code entry counter and deoptimization I see that we go to interpreter and then compile T2 which then trigger T4 compilation (per "persistent profiling" policy). I think we may indeed need A2 for methods compiled for A4. I am doing experiments to see if we indeed need these A2 to get more, up to date, profiling data for T4 compilation during productions run to get peak performance. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3837092158
On Wed, 17 Sep 2025 10:37:37 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
- Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Merge branch 'premain' into JDK-8366681-precompile-more-c1 - Fix
OK, tell me if you want me to re-open the PR and do more A2 code. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3840349801
On Tue, 3 Feb 2026 10:08:06 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
OK, tell me if you want me to re-open the PR and do more A2 code.
No need to reopen it. My experiments yesterday shows that directly requesting T4 compilation (instead of deoptimization) after we hit entry counter limit in AOT code is better then using A2 to profile and trigger T4 after that. ------------- PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3843288516
On Tue, 2 Sep 2025 10:29:33 GMT, Aleksey Shipilev <shade@openjdk.org> wrote:
Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit.
Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
Additional testing: - [x] `javac` performance tests (see comments) - [x] Linux x86_64 server fastdebug, `runtime/cds`
This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/leyden/pull/93
participants (4)
-
Aleksey Shipilev
-
Igor Veresov
-
Vladimir Ivanov
-
Vladimir Kozlov