From shade at openjdk.org Tue Dec 2 11:40:23 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Dec 2025 11:40:23 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v5] In-Reply-To: References: Message-ID: > Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. > > Notable improvements: > 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. > 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. > > Additional testing: > - [x] Performance tests (see comments) > - [x] Linux x86_64 server fastdebug, `runtime/cds` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'premain' into JDK-8368465-precompiler-method-select - Drop the mention of MDO - Merge branch 'premain' into JDK-8368465-precompiler-method-select - Merge branch 'premain' into JDK-8368465-precompiler-method-select - Touchup - Touchups - Fix ------------- Changes: - all: https://git.openjdk.org/leyden/pull/99/files - new: https://git.openjdk.org/leyden/pull/99/files/49f95523..3d298056 Webrevs: - full: https://webrevs.openjdk.org/?repo=leyden&pr=99&range=04 - incr: https://webrevs.openjdk.org/?repo=leyden&pr=99&range=03-04 Stats: 158648 lines in 2753 files changed: 91357 ins; 50434 del; 16857 mod Patch: https://git.openjdk.org/leyden/pull/99.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/99/head:pull/99 PR: https://git.openjdk.org/leyden/pull/99 From shade at openjdk.org Tue Dec 2 11:40:24 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Dec 2025 11:40:24 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v4] In-Reply-To: References: Message-ID: On Fri, 17 Oct 2025 21:35:04 GMT, Aleksey Shipilev wrote: >> Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. >> >> Notable improvements: >> 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. >> 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. >> >> Additional testing: >> - [x] Performance tests (see comments) >> - [x] Linux x86_64 server fastdebug, `runtime/cds` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Drop the mention of MDO Getting back to this... ------------- PR Comment: https://git.openjdk.org/leyden/pull/99#issuecomment-3601598908 From shade at openjdk.org Tue Dec 2 11:40:25 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Dec 2025 11:40:25 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Fri, 17 Oct 2025 23:50:02 GMT, Vladimir Kozlov wrote: > We bulk load all "Preload" AOT code - ordering does not matter for it. Even if we load in selected order. It is one thread which do loading and it is blocking (I actually playing with spreading this preload on all compiler threads - did not see much affect on startup). Um, I don't think that's the case? Preloading is asynchronous. See `CompileBroker::compile_method`: bool is_blocking = ReplayCompiles || !directive->BackgroundCompilationOption || (PreloadBlocking && (compile_reason == CompileTask::Reason_Preload)); compile_method_base(method, osr_bci, comp_level, hot_count, compile_reason, requires_online_compilation, is_blocking, THREAD); We have the option to _make_ preload blocking (`PreloadBlocking`), but it is turned off by default. We know enabling `+PreloadBlocking` is counter-productive, because it could easily take hundreds of milliseconds. So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. Maybe the compilation policy should actually participate in preloading: i.e. if there is a hot code that transitions from T0 to any other level, attempt the preload first, in case normal preloading is lagging behind. That would be more intrusive, though, so as the conservative approach I would like to prioritize more profitable preload code first. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2580796305 From shade at openjdk.org Tue Dec 2 16:41:52 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Dec 2025 16:41:52 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v5] In-Reply-To: References: Message-ID: On Tue, 2 Dec 2025 11:40:23 GMT, Aleksey Shipilev wrote: >> Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. >> >> Notable improvements: >> 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. >> 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. >> >> Additional testing: >> - [x] Performance tests (see comments) >> - [x] Linux x86_64 server fastdebug, `runtime/cds` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Merge branch 'premain' into JDK-8368465-precompiler-method-select > - Drop the mention of MDO > - Merge branch 'premain' into JDK-8368465-precompiler-method-select > - Merge branch 'premain' into JDK-8368465-precompiler-method-select > - Touchup > - Touchups > - Fix I re-merged with current `premain`, re-measured some light benchmarks, and the performance improvements are still there. I still believe this is a useful thing to do for infrastructural reasons (gives me access to more advanced selection policies), and performance boost comes as a nice bonus. There are other possibilities in optimizing interaction with preload code, and those can and should be done separately, IMO. Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:AOTCache=app.aot -Xms64m -Xmx1g -XX:+UseSerialGC -cp JavacBenchApp.jar JavacBenchApp 50 ### 2 cores # Baseline Time (mean ? ?): 425.7 ms ? 17.6 ms [User: 667.3 ms, System: 96.3 ms] Range (min ? max): 404.8 ms ? 458.1 ms 10 runs Time (mean ? ?): 427.5 ms ? 18.3 ms [User: 668.7 ms, System: 99.6 ms] Range (min ? max): 399.2 ms ? 451.0 ms 10 runs Time (mean ? ?): 418.6 ms ? 11.6 ms [User: 657.2 ms, System: 96.1 ms] Range (min ? max): 402.5 ms ? 436.7 ms 10 runs # Patched Time (mean ? ?): 373.4 ms ? 11.7 ms [User: 547.1 ms, System: 89.7 ms] Range (min ? max): 359.3 ms ? 397.5 ms 10 runs Time (mean ? ?): 363.4 ms ? 8.5 ms [User: 511.6 ms, System: 92.6 ms] Range (min ? max): 346.2 ms ? 373.7 ms 10 runs Time (mean ? ?): 370.4 ms ? 11.9 ms [User: 520.3 ms, System: 93.4 ms] Range (min ? max): 353.4 ms ? 384.3 ms 10 runs ------------- PR Comment: https://git.openjdk.org/leyden/pull/99#issuecomment-3602972698 From kvn at openjdk.org Tue Dec 2 22:02:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 2 Dec 2025 22:02:19 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Tue, 2 Dec 2025 11:35:43 GMT, Aleksey Shipilev wrote: > So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. Okay, I think I can understand when compile ID may screw up us. If T4 compilation during training run happens several times for the same method due to deoptimization we will cache only last corresponding AP4: if (entry->for_preload()) { if (entry->not_entrant()) { // Skip not entrant preload code: such entry will have high compile ID. We can keep early ID and used it for cached AP4 to avoid this. Which leads to an other issue. In initial AOT code implementation I kept deoptimization counter for A4 and use it when search A4 to load in production run. We removed that counter but kept all versions of A4 but `find_entry()` will return first A4 it found which may have a lot more uncommon traps when latest A4 version. May be we should filter A4 the same way we do for AP4. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2582832469 From kvn at openjdk.org Tue Dec 2 22:02:26 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 2 Dec 2025 22:02:26 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Tue, 2 Dec 2025 21:28:53 GMT, Vladimir Kozlov wrote: >>> We bulk load all "Preload" AOT code - ordering does not matter for it. Even if we load in selected order. It is one thread which do loading and it is blocking (I actually playing with spreading this preload on all compiler threads - did not see much affect on startup). >> >> Um, I don't think that's the case? Preloading is asynchronous. See `CompileBroker::compile_method`: >> >> >> bool is_blocking = ReplayCompiles || >> !directive->BackgroundCompilationOption || >> (PreloadBlocking && (compile_reason == CompileTask::Reason_Preload)); >> compile_method_base(method, osr_bci, comp_level, hot_count, compile_reason, requires_online_compilation, is_blocking, THREAD); >> >> >> We have the option to _make_ preload blocking (`PreloadBlocking`), but it is turned off by default. We know enabling `+PreloadBlocking` is counter-productive, because it could easily take hundreds of milliseconds. >> >> So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. >> >> Maybe the compilation policy should actually participate in preloading: i.e. if there is a hot code that transitions from T0 to any other level, attempt the preload first, in case normal preloading is lagging behind. That would be more intrusive, though, so as the conservative approach I would like to prioritize more profitable preload code first. > >> So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. > > Okay, I think I can understand when compile ID may screw up us. If T4 compilation during training run happens several times for the same method due to deoptimization we will cache only last corresponding AP4: > > if (entry->for_preload()) { > if (entry->not_entrant()) { > // Skip not entrant preload code: > > such entry will have high compile ID. We can keep early ID and used it for cached AP4 to avoid this. > > Which leads to an other issue. In initial AOT code implementation I kept deoptimization counter for A4 and use it when search A4 to load in production run. We removed that counter but kept all versions of A4 but `find_entry()` will return first A4 it found which may have a lot more uncommon traps when latest A4 version. May be we should filter A4 the same way we do for AP4. By "blocking" I mean that we have only one AOT compiler thread to load AP4. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2582836251 From kvn at openjdk.org Tue Dec 2 22:02:33 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 2 Dec 2025 22:02:33 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Tue, 2 Dec 2025 21:30:18 GMT, Vladimir Kozlov wrote: >>> So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. >> >> Okay, I think I can understand when compile ID may screw up us. If T4 compilation during training run happens several times for the same method due to deoptimization we will cache only last corresponding AP4: >> >> if (entry->for_preload()) { >> if (entry->not_entrant()) { >> // Skip not entrant preload code: >> >> such entry will have high compile ID. We can keep early ID and used it for cached AP4 to avoid this. >> >> Which leads to an other issue. In initial AOT code implementation I kept deoptimization counter for A4 and use it when search A4 to load in production run. We removed that counter but kept all versions of A4 but `find_entry()` will return first A4 it found which may have a lot more uncommon traps when latest A4 version. May be we should filter A4 the same way we do for AP4. > > By "blocking" I mean that we have only one AOT compiler thread to load AP4. May be your change reduced number of AOT compiled nmethod in cache which allow faster processing. Please run with `-Xlog:aot+codecache+init=debug -XX:+CITime` for production run to see how many AOT nmethods in AOT cache and how many were loaded/used. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2582859309