RFR: 8366681: [leyden] Precompile more C1 code

Tue Sep 2 10:45:25 UTC 2025

On Tue, 2 Sep 2025 10:29:33 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> Looking at how code goes through AOT+JIT pipeline, I believe we have several issues in the way we include the methods for precompilation.
> 
> 1. AP4 code gets replaced by more efficient A4 code, which can then deopt. Once it does, we go back to the fully normal JIT pipeline, with C1 compiling, C2 compiling, etc. Training run currently does A2 versions only when there is a tier2/3 training data present. We can pessimistically assume that A4/AP4 method should have A2 method generated for the sake of quicker deopt.
> 
> 2. I suspect a similar thing, but rarer, happens with A4 -> ... -> T1 transition when compiler queues are overloaded. We can generate A1 method for this case.
> 
> 3. When training is done with default configuration, but at runtime we enable only C1, we summarily miss almost *all* AOT methods, because A1 methods are rarely generated with a normal tiered policy. Generating A1 methods always would be convenient for hybrid C2 AOT + C1 JIT modes as well.
> 
> Overall, I think generating more C1 methods even when C2 methods are present in training is beneficial, as we prepare the ground for whatever corner case happens at runtime. Benchmarks show this improves performance model quite a bit. 
> 
> Since we now look at methods at all different tiers when deciding to precompile, compile IDs are not working all that well. I have rewritten that to use counters and method sizes. This seems to work well in practice.
> 
> Additional testing:
>  - [x] `javac` performance tests (see comments)
>  - [x] Linux x86_64 server fastdebug, `runtime/cds`

Test results:

Baselines: default and C1 only:

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
 -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):      1.052 s ±  0.014 s    [User: 3.694 s, System: 0.144 s]
  Range (min … max):    1.029 s …  1.091 s    30 runs

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
 -XX:TieredStopAtLevel=1 \
 -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     795.4 ms ±   5.4 ms    [User: 1469.7 ms, System: 109.0 ms]
  Range (min … max):   787.3 ms … 809.4 ms    30 runs

Premain baseline:

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \ 
 -XX:AOTCache=app.aot \
 -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     584.4 ms ±  23.1 ms    [User: 2177.1 ms, System: 156.4 ms]
  Range (min … max):   540.6 ms … 640.7 ms    30 runs

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
 -XX:AOTCache=app.aot -XX:TieredStopAtLevel=1 \
 -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     893.6 ms ±   7.5 ms    [User: 1670.2 ms, System: 70.2 ms]
  Range (min … max):   881.5 ms … 921.4 ms    30 runs

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
 -XX:AOTCache=app.aot -XX:+UnlockExperimentalVMOptions -XX:+PreloadOnly \
 -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     474.7 ms ±   3.9 ms    [User: 470.8 ms, System: 57.2 ms]
  Range (min … max):   468.0 ms … 482.5 ms    30 runs

Premain fixed:

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
 -XX:AOTCache=app.aot \
 -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     483.8 ms ±   8.7 ms    [User: 1454.7 ms, System: 167.3 ms]
  Range (min … max):   471.2 ms … 510.4 ms    30 runs

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
 -XX:AOTCache=app.aot -XX:TieredStopAtLevel=1 \
 -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     440.0 ms ±   3.6 ms    [User: 469.2 ms, System: 69.5 ms]
  Range (min … max):   434.4 ms … 447.7 ms    30 runs

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
 -XX:AOTCache=app.aot -XX:+UnlockExperimentalVMOptions -XX:+PreloadOnly \
 -cp JavacBenchApp.jar JavacBenchApp 50
  Time (mean ± σ):     435.8 ms ±   3.3 ms    [User: 440.5 ms, System: 63.1 ms]
  Range (min … max):   429.0 ms … 441.7 ms    30 runs

Note massive improvements in both out of the box and C1-only modes.

Out of the box mode improves significantly, because there are unfortunate deopts from A4, which involve the compilers to generate new methods. It looks that pre-compiling A2 code allows this process to reach T4 code with fewer overheads.

C1-only improves significantly, because now we have A1 methods in AOT cache. It even gets very close to C2-AOT-only (`+PreloadOnly`) mode! There are more things to do on that path: I have a patch that build on this and implements the hybrid C2 AOT + C1 JIT mode, beating all these configs by 25% more.

I suspect a modest improvement to `+PreloadOnly` is likely due to new method sorting code that sorts by hotness.

The downside is the size of AOTCache. For Javac test, the AOTCache grows from `54M` to `78M`. I would think this is a fair price for much flatter performance model, _and_ I think we need to deal with generated code density more thoroughly anyway.

-------------

PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3244753665
PR Comment: https://git.openjdk.org/leyden/pull/93#issuecomment-3244762639