From shade at openjdk.org Tue Dec 2 11:40:23 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Dec 2025 11:40:23 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v5] In-Reply-To: References: Message-ID: > Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. > > Notable improvements: > 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. > 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. > > Additional testing: > - [x] Performance tests (see comments) > - [x] Linux x86_64 server fastdebug, `runtime/cds` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'premain' into JDK-8368465-precompiler-method-select - Drop the mention of MDO - Merge branch 'premain' into JDK-8368465-precompiler-method-select - Merge branch 'premain' into JDK-8368465-precompiler-method-select - Touchup - Touchups - Fix ------------- Changes: - all: https://git.openjdk.org/leyden/pull/99/files - new: https://git.openjdk.org/leyden/pull/99/files/49f95523..3d298056 Webrevs: - full: https://webrevs.openjdk.org/?repo=leyden&pr=99&range=04 - incr: https://webrevs.openjdk.org/?repo=leyden&pr=99&range=03-04 Stats: 158648 lines in 2753 files changed: 91357 ins; 50434 del; 16857 mod Patch: https://git.openjdk.org/leyden/pull/99.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/99/head:pull/99 PR: https://git.openjdk.org/leyden/pull/99 From shade at openjdk.org Tue Dec 2 11:40:24 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Dec 2025 11:40:24 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v4] In-Reply-To: References: Message-ID: On Fri, 17 Oct 2025 21:35:04 GMT, Aleksey Shipilev wrote: >> Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. >> >> Notable improvements: >> 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. >> 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. >> >> Additional testing: >> - [x] Performance tests (see comments) >> - [x] Linux x86_64 server fastdebug, `runtime/cds` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Drop the mention of MDO Getting back to this... ------------- PR Comment: https://git.openjdk.org/leyden/pull/99#issuecomment-3601598908 From shade at openjdk.org Tue Dec 2 11:40:25 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Dec 2025 11:40:25 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Fri, 17 Oct 2025 23:50:02 GMT, Vladimir Kozlov wrote: > We bulk load all "Preload" AOT code - ordering does not matter for it. Even if we load in selected order. It is one thread which do loading and it is blocking (I actually playing with spreading this preload on all compiler threads - did not see much affect on startup). Um, I don't think that's the case? Preloading is asynchronous. See `CompileBroker::compile_method`: bool is_blocking = ReplayCompiles || !directive->BackgroundCompilationOption || (PreloadBlocking && (compile_reason == CompileTask::Reason_Preload)); compile_method_base(method, osr_bci, comp_level, hot_count, compile_reason, requires_online_compilation, is_blocking, THREAD); We have the option to _make_ preload blocking (`PreloadBlocking`), but it is turned off by default. We know enabling `+PreloadBlocking` is counter-productive, because it could easily take hundreds of milliseconds. So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. Maybe the compilation policy should actually participate in preloading: i.e. if there is a hot code that transitions from T0 to any other level, attempt the preload first, in case normal preloading is lagging behind. That would be more intrusive, though, so as the conservative approach I would like to prioritize more profitable preload code first. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2580796305 From shade at openjdk.org Tue Dec 2 16:41:52 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 2 Dec 2025 16:41:52 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v5] In-Reply-To: References: Message-ID: On Tue, 2 Dec 2025 11:40:23 GMT, Aleksey Shipilev wrote: >> Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. >> >> Notable improvements: >> 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. >> 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. >> >> Additional testing: >> - [x] Performance tests (see comments) >> - [x] Linux x86_64 server fastdebug, `runtime/cds` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Merge branch 'premain' into JDK-8368465-precompiler-method-select > - Drop the mention of MDO > - Merge branch 'premain' into JDK-8368465-precompiler-method-select > - Merge branch 'premain' into JDK-8368465-precompiler-method-select > - Touchup > - Touchups > - Fix I re-merged with current `premain`, re-measured some light benchmarks, and the performance improvements are still there. I still believe this is a useful thing to do for infrastructural reasons (gives me access to more advanced selection policies), and performance boost comes as a nice bonus. There are other possibilities in optimizing interaction with preload code, and those can and should be done separately, IMO. Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:AOTCache=app.aot -Xms64m -Xmx1g -XX:+UseSerialGC -cp JavacBenchApp.jar JavacBenchApp 50 ### 2 cores # Baseline Time (mean ? ?): 425.7 ms ? 17.6 ms [User: 667.3 ms, System: 96.3 ms] Range (min ? max): 404.8 ms ? 458.1 ms 10 runs Time (mean ? ?): 427.5 ms ? 18.3 ms [User: 668.7 ms, System: 99.6 ms] Range (min ? max): 399.2 ms ? 451.0 ms 10 runs Time (mean ? ?): 418.6 ms ? 11.6 ms [User: 657.2 ms, System: 96.1 ms] Range (min ? max): 402.5 ms ? 436.7 ms 10 runs # Patched Time (mean ? ?): 373.4 ms ? 11.7 ms [User: 547.1 ms, System: 89.7 ms] Range (min ? max): 359.3 ms ? 397.5 ms 10 runs Time (mean ? ?): 363.4 ms ? 8.5 ms [User: 511.6 ms, System: 92.6 ms] Range (min ? max): 346.2 ms ? 373.7 ms 10 runs Time (mean ? ?): 370.4 ms ? 11.9 ms [User: 520.3 ms, System: 93.4 ms] Range (min ? max): 353.4 ms ? 384.3 ms 10 runs ------------- PR Comment: https://git.openjdk.org/leyden/pull/99#issuecomment-3602972698 From kvn at openjdk.org Tue Dec 2 22:02:19 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 2 Dec 2025 22:02:19 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Tue, 2 Dec 2025 11:35:43 GMT, Aleksey Shipilev wrote: > So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. Okay, I think I can understand when compile ID may screw up us. If T4 compilation during training run happens several times for the same method due to deoptimization we will cache only last corresponding AP4: if (entry->for_preload()) { if (entry->not_entrant()) { // Skip not entrant preload code: such entry will have high compile ID. We can keep early ID and used it for cached AP4 to avoid this. Which leads to an other issue. In initial AOT code implementation I kept deoptimization counter for A4 and use it when search A4 to load in production run. We removed that counter but kept all versions of A4 but `find_entry()` will return first A4 it found which may have a lot more uncommon traps when latest A4 version. May be we should filter A4 the same way we do for AP4. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2582832469 From kvn at openjdk.org Tue Dec 2 22:02:26 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 2 Dec 2025 22:02:26 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Tue, 2 Dec 2025 21:28:53 GMT, Vladimir Kozlov wrote: >>> We bulk load all "Preload" AOT code - ordering does not matter for it. Even if we load in selected order. It is one thread which do loading and it is blocking (I actually playing with spreading this preload on all compiler threads - did not see much affect on startup). >> >> Um, I don't think that's the case? Preloading is asynchronous. See `CompileBroker::compile_method`: >> >> >> bool is_blocking = ReplayCompiles || >> !directive->BackgroundCompilationOption || >> (PreloadBlocking && (compile_reason == CompileTask::Reason_Preload)); >> compile_method_base(method, osr_bci, comp_level, hot_count, compile_reason, requires_online_compilation, is_blocking, THREAD); >> >> >> We have the option to _make_ preload blocking (`PreloadBlocking`), but it is turned off by default. We know enabling `+PreloadBlocking` is counter-productive, because it could easily take hundreds of milliseconds. >> >> So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. >> >> Maybe the compilation policy should actually participate in preloading: i.e. if there is a hot code that transitions from T0 to any other level, attempt the preload first, in case normal preloading is lagging behind. That would be more intrusive, though, so as the conservative approach I would like to prioritize more profitable preload code first. > >> So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. > > Okay, I think I can understand when compile ID may screw up us. If T4 compilation during training run happens several times for the same method due to deoptimization we will cache only last corresponding AP4: > > if (entry->for_preload()) { > if (entry->not_entrant()) { > // Skip not entrant preload code: > > such entry will have high compile ID. We can keep early ID and used it for cached AP4 to avoid this. > > Which leads to an other issue. In initial AOT code implementation I kept deoptimization counter for A4 and use it when search A4 to load in production run. We removed that counter but kept all versions of A4 but `find_entry()` will return first A4 it found which may have a lot more uncommon traps when latest A4 version. May be we should filter A4 the same way we do for AP4. By "blocking" I mean that we have only one AOT compiler thread to load AP4. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2582836251 From kvn at openjdk.org Tue Dec 2 22:02:33 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 2 Dec 2025 22:02:33 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Tue, 2 Dec 2025 21:30:18 GMT, Vladimir Kozlov wrote: >>> So while compilers are working through preloading the code, the application runs and can trigger compilations. Sorting preloading methods allows loading hottest code before that code transits to normal compilation. This is the problem I am trying to mitigate. >> >> Okay, I think I can understand when compile ID may screw up us. If T4 compilation during training run happens several times for the same method due to deoptimization we will cache only last corresponding AP4: >> >> if (entry->for_preload()) { >> if (entry->not_entrant()) { >> // Skip not entrant preload code: >> >> such entry will have high compile ID. We can keep early ID and used it for cached AP4 to avoid this. >> >> Which leads to an other issue. In initial AOT code implementation I kept deoptimization counter for A4 and use it when search A4 to load in production run. We removed that counter but kept all versions of A4 but `find_entry()` will return first A4 it found which may have a lot more uncommon traps when latest A4 version. May be we should filter A4 the same way we do for AP4. > > By "blocking" I mean that we have only one AOT compiler thread to load AP4. May be your change reduced number of AOT compiled nmethod in cache which allow faster processing. Please run with `-Xlog:aot+codecache+init=debug -XX:+CITime` for production run to see how many AOT nmethods in AOT cache and how many were loaded/used. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2582859309 From shade at openjdk.org Wed Dec 3 12:49:02 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 3 Dec 2025 12:49:02 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Tue, 2 Dec 2025 21:38:57 GMT, Vladimir Kozlov wrote: >> By "blocking" I mean that we have only one AOT compiler thread to load AP4. > > May be your change reduced number of AOT compiled nmethod in cache which allow faster processing. > Please run with `-Xlog:aot+codecache+init=debug -XX:+CITime` for production run to see how many AOT nmethods in AOT cache and how many were loaded/used. Actually... Now I see we generate, and thus use substantially A2 code! This also aligns with performance data: we have way fewer C1 compilations with this patch. # ==== Baseline Create: [4.593s][info][precompile] Precompilation for level 1 finished (94 successful out of 94 total) [4.604s][info][precompile] Precompilation for level 2 finished (131 successful out of 131 total) [4.814s][info][precompile] Precompilation for level 2 finished (1852 successful out of 1852 total) [6.035s][info][precompile] Precompilation for level 4 finished (1660 successful out of 1660 total) [4.589s][info][precompile] Precompilation for level 5 finished (1660 successful out of 1660 total) Use: Tier1 {speed: 42838.159 bytes/s; standard: 0.014 s, 582 bytes, 135 methods; ...} Tier2 {speed: 210303.802 bytes/s; standard: 0.311 s, 63857 bytes, 817 methods; ...} Tier3 {speed: 134013.414 bytes/s; standard: 0.035 s, 4685 bytes, 245 methods; ...} Tier4 {speed: 69205.374 bytes/s; standard: 0.051 s, 3225 bytes, 13 methods; ...} AOT Code T1 {speed: 297580.645 bytes/s; standard: 0.001 s, 369 bytes, 94 methods; ...} AOT Code T2 {speed: 5654043.587 bytes/s; standard: 0.042 s, 237861 bytes, 1969 methods; ...} AOT Code T4 {speed: 25219362.296 bytes/s; standard: 0.029 s, 737408 bytes, 927 methods; ...} AOT Code T5 {speed: 30793594.418 bytes/s; standard: 0.048 s, 1474270 bytes, 1658 methods; ...} # ==== Patched Create: [3.984s][info][precompile] Precompilation for level 1 finished (311 successful out of 311 total) [4.382s][info][precompile] Precompilation for level 2 finished (2752 successful out of 2752 total) [4.383s][info][precompile] Precompilation for level 3 finished (0 successful out of 0 total) [5.392s][info][precompile] Precompilation for level 4 finished (1641 successful out of 1641 total) [3.972s][info][precompile] Precompilation for level 5 finished (1641 successful out of 1641 total) Use: Tier1 {speed: 0.000 bytes/s; standard: 0.000 s, 0 bytes, 0 methods; ... Tier2 {speed: 579987.470 bytes/s; standard: 0.026 s, 15526 bytes, 44 methods; ... Tier3 {speed: 181499.273 bytes/s; standard: 0.026 s, 4761 bytes, 254 methods; ... Tier4 {speed: 77265.133 bytes/s; standard: 0.027 s, 2087 bytes, 12 methods; ... AOT Code T1 {speed: 432360.583 bytes/s; standard: 0.002 s, 942 bytes, 228 methods; ... AOT Code T2 {speed: 6664604.248 bytes/s; standard: 0.042 s, 281287 bytes, 2735 methods; ... AOT Code T4 {speed: 26296881.658 bytes/s; standard: 0.026 s, 682331 bytes, 924 methods; ... AOT Code T5 {speed: 33814172.284 bytes/s; standard: 0.042 s, 1430045 bytes, 1632 methods; ... So I have changed something in selection code that takes on more A2 compiles, profitably. Have not yet confirmed if preload order has any effect on top of that. Investigating... ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2584978192 From shade at openjdk.org Wed Dec 3 13:32:33 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 3 Dec 2025 13:32:33 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Wed, 3 Dec 2025 12:46:31 GMT, Aleksey Shipilev wrote: >> May be your change reduced number of AOT compiled nmethod in cache which allow faster processing. >> Please run with `-Xlog:aot+codecache+init=debug -XX:+CITime` for production run to see how many AOT nmethods in AOT cache and how many were loaded/used. > > Actually... Now I see we generate, and thus use substantially A2 code! This also aligns with performance data: we have way fewer C1 compilations with this patch. > > > # ==== Baseline > > Create: > [4.593s][info][precompile] Precompilation for level 1 finished (94 successful out of 94 total) > [4.604s][info][precompile] Precompilation for level 2 finished (131 successful out of 131 total) > [4.814s][info][precompile] Precompilation for level 2 finished (1852 successful out of 1852 total) > [6.035s][info][precompile] Precompilation for level 4 finished (1660 successful out of 1660 total) > [4.589s][info][precompile] Precompilation for level 5 finished (1660 successful out of 1660 total) > > Use: > Tier1 {speed: 42838.159 bytes/s; standard: 0.014 s, 582 bytes, 135 methods; ...} > Tier2 {speed: 210303.802 bytes/s; standard: 0.311 s, 63857 bytes, 817 methods; ...} > Tier3 {speed: 134013.414 bytes/s; standard: 0.035 s, 4685 bytes, 245 methods; ...} > Tier4 {speed: 69205.374 bytes/s; standard: 0.051 s, 3225 bytes, 13 methods; ...} > AOT Code T1 {speed: 297580.645 bytes/s; standard: 0.001 s, 369 bytes, 94 methods; ...} > AOT Code T2 {speed: 5654043.587 bytes/s; standard: 0.042 s, 237861 bytes, 1969 methods; ...} > AOT Code T4 {speed: 25219362.296 bytes/s; standard: 0.029 s, 737408 bytes, 927 methods; ...} > AOT Code T5 {speed: 30793594.418 bytes/s; standard: 0.048 s, 1474270 bytes, 1658 methods; ...} > > > # ==== Patched > > Create: > [3.984s][info][precompile] Precompilation for level 1 finished (311 successful out of 311 total) > [4.382s][info][precompile] Precompilation for level 2 finished (2752 successful out of 2752 total) > [4.383s][info][precompile] Precompilation for level 3 finished (0 successful out of 0 total) > [5.392s][info][precompile] Precompilation for level 4 finished (1641 successful out of 1641 total) > [3.972s][info][precompile] Precompilation for level 5 finished (1641 successful out of 1641 total) > > Use: > Tier1 {speed: 0.000 bytes/s; standard: 0.000 s, 0 bytes, 0 methods; ... > Tier2 {speed: 579987.470 bytes/s; standard: 0.026 s, 15526 bytes, 44 methods; ... > Tier3 {speed: 181499.273 bytes/s; standard: 0.026 s, 4761 bytes, 254 methods; ... > Tier4 {speed: 77265.133 bytes/s; standard: 0.027 s, 2087 bytes, 12 methods; ... > AOT Code T1 {speed: 432360.583 bytes/s; standard: 0.002 s, 942 bytes, 228 methods; ... > AOT Code T2 {speed: 6664604.248 bytes/s; standard: 0.042 s, 281287 bytes, 2735 methods; ... > AOT Code T4 {speed: 26296881.658 bytes/s... LOL, I think I found the performance bug in the original code that I fixed by accident. `MTD::highest_level()` includes the cases when method is inlined. For T3 code, it would return `4` if we ended up inlining that method into T4 code. Which would fail the inclusion check for A2 compilation, even though we did have a legit top-level T3 compile. I would say the fact we have inlined T3 in _some_ T4 should _not_ disqualify A2 compilation. This change alone gives the same kind of performance boost as my patch: diff --git a/src/hotspot/share/compiler/precompiler.cpp b/src/hotspot/share/compiler/precompiler.cpp index 04f95857a63..8a5da803b04 100644 --- a/src/hotspot/share/compiler/precompiler.cpp +++ b/src/hotspot/share/compiler/precompiler.cpp @@ -84,7 +84,7 @@ class PrecompileIterator : StackObj { static int compile_id(Method* m, int level) { MethodTrainingData* mtd = m->method_holder()->is_loaded() ? MethodTrainingData::find(methodHandle(Thread::current(), m)) : nullptr; - if (mtd != nullptr && mtd->highest_level() == level) { + if (mtd != nullptr && mtd->highest_top_level() == level) { CompileTrainingData* ctd = mtd->last_toplevel_compile(level); if (ctd != nullptr) { return ctd->compile_id(); This thing is confusing, and yet another reason why I this PR looks more understandable: it very explicitly checks `MTD::highest_top_level()` when deciding whether to accept the method. It does not do this with this bug, and also does not implicitly participate in filtering by `compile_id() < INT_MAX`. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2585116344 From shade at openjdk.org Wed Dec 3 13:58:45 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 3 Dec 2025 13:58:45 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v6] In-Reply-To: References: Message-ID: > Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. > > Notable improvements: > 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. > 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. > > Additional testing: > - [x] Performance tests (see comments) > - [x] Linux x86_64 server fastdebug, `runtime/cds` Aleksey Shipilev has updated the pull request incrementally with three additional commits since the last revision: - More cosmetics - Improve compile ID sorting - Revert sorting by method count ------------- Changes: - all: https://git.openjdk.org/leyden/pull/99/files - new: https://git.openjdk.org/leyden/pull/99/files/3d298056..fc30a139 Webrevs: - full: https://webrevs.openjdk.org/?repo=leyden&pr=99&range=05 - incr: https://webrevs.openjdk.org/?repo=leyden&pr=99&range=04-05 Stats: 60 lines in 3 files changed: 14 ins; 35 del; 11 mod Patch: https://git.openjdk.org/leyden/pull/99.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/99/head:pull/99 PR: https://git.openjdk.org/leyden/pull/99 From shade at openjdk.org Wed Dec 3 13:58:47 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 3 Dec 2025 13:58:47 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v3] In-Reply-To: References: <6e95ti55Z9-fcZpSIirJolx6vMcmN01T40eYSD8nIbU=.95a7423d-685a-4918-bf65-73a0358b588d@github.com> Message-ID: On Wed, 3 Dec 2025 13:27:39 GMT, Aleksey Shipilev wrote: >> Actually... Now I see we generate, and thus use substantially A2 code! This also aligns with performance data: we have way fewer C1 compilations with this patch. >> >> >> # ==== Baseline >> >> Create: >> [4.593s][info][precompile] Precompilation for level 1 finished (94 successful out of 94 total) >> [4.604s][info][precompile] Precompilation for level 2 finished (131 successful out of 131 total) >> [4.814s][info][precompile] Precompilation for level 2 finished (1852 successful out of 1852 total) >> [6.035s][info][precompile] Precompilation for level 4 finished (1660 successful out of 1660 total) >> [4.589s][info][precompile] Precompilation for level 5 finished (1660 successful out of 1660 total) >> >> Use: >> Tier1 {speed: 42838.159 bytes/s; standard: 0.014 s, 582 bytes, 135 methods; ...} >> Tier2 {speed: 210303.802 bytes/s; standard: 0.311 s, 63857 bytes, 817 methods; ...} >> Tier3 {speed: 134013.414 bytes/s; standard: 0.035 s, 4685 bytes, 245 methods; ...} >> Tier4 {speed: 69205.374 bytes/s; standard: 0.051 s, 3225 bytes, 13 methods; ...} >> AOT Code T1 {speed: 297580.645 bytes/s; standard: 0.001 s, 369 bytes, 94 methods; ...} >> AOT Code T2 {speed: 5654043.587 bytes/s; standard: 0.042 s, 237861 bytes, 1969 methods; ...} >> AOT Code T4 {speed: 25219362.296 bytes/s; standard: 0.029 s, 737408 bytes, 927 methods; ...} >> AOT Code T5 {speed: 30793594.418 bytes/s; standard: 0.048 s, 1474270 bytes, 1658 methods; ...} >> >> >> # ==== Patched >> >> Create: >> [3.984s][info][precompile] Precompilation for level 1 finished (311 successful out of 311 total) >> [4.382s][info][precompile] Precompilation for level 2 finished (2752 successful out of 2752 total) >> [4.383s][info][precompile] Precompilation for level 3 finished (0 successful out of 0 total) >> [5.392s][info][precompile] Precompilation for level 4 finished (1641 successful out of 1641 total) >> [3.972s][info][precompile] Precompilation for level 5 finished (1641 successful out of 1641 total) >> >> Use: >> Tier1 {speed: 0.000 bytes/s; standard: 0.000 s, 0 bytes, 0 methods; ... >> Tier2 {speed: 579987.470 bytes/s; standard: 0.026 s, 15526 bytes, 44 methods; ... >> Tier3 {speed: 181499.273 bytes/s; standard: 0.026 s, 4761 bytes, 254 methods; ... >> Tier4 {speed: 77265.133 bytes/s; standard: 0.027 s, 2087 bytes, 12 methods; ... >> AOT Code T1 {speed: 432360.583 bytes/s; standard: 0.002 s, 942 bytes, 228 methods; ... >> AOT Code T2 {speed: 6664604.248 bytes/s; standard: 0.042... > > LOL, I think I found the performance bug in the original code that I fixed by accident. `MTD::highest_level()` includes the cases when method is inlined. For T2/T3 code, it would return `4` if we ended up inlining that method into T4 code. Which would fail the inclusion check for A2 compilation, even though we did have a legit top-level T2/T3 compile. I would say the fact we have inlined T2/T3 in _some_ T4 should _not_ disqualify A2 compilation. > > This change alone gives the same kind of performance boost as my patch: > > > diff --git a/src/hotspot/share/compiler/precompiler.cpp b/src/hotspot/share/compiler/precompiler.cpp > index 04f95857a63..8a5da803b04 100644 > --- a/src/hotspot/share/compiler/precompiler.cpp > +++ b/src/hotspot/share/compiler/precompiler.cpp > @@ -84,7 +84,7 @@ class PrecompileIterator : StackObj { > > static int compile_id(Method* m, int level) { > MethodTrainingData* mtd = m->method_holder()->is_loaded() ? MethodTrainingData::find(methodHandle(Thread::current(), m)) : nullptr; > - if (mtd != nullptr && mtd->highest_level() == level) { > + if (mtd != nullptr && mtd->highest_top_level() == level) { > CompileTrainingData* ctd = mtd->last_toplevel_compile(level); > if (ctd != nullptr) { > return ctd->compile_id(); > > > This thing is confusing, and yet another reason why I this PR looks more understandable: it very explicitly checks `MTD::highest_top_level()` when deciding whether to accept the method. It does not do this with this bug, and also does not implicitly participate in filtering by `compile_id() < INT_MAX`. Reverted the sorting back to compile IDs instead of counters/size, as it does not affect performance, really. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/99#discussion_r2585220837 From kvn at openjdk.org Wed Dec 3 20:29:00 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 3 Dec 2025 20:29:00 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v6] In-Reply-To: References: Message-ID: <6cfyj5vijT_yyyjhMynhBXqE6hT2Wt0GUoKqYbFwBIc=.820c4dda-b18d-452b-a10e-e4c8ed32753b@github.com> On Wed, 3 Dec 2025 13:58:45 GMT, Aleksey Shipilev wrote: >> Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. >> >> Notable improvements: >> 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. >> 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. >> >> Additional testing: >> - [x] Performance tests (see comments) >> - [x] Linux x86_64 server fastdebug, `runtime/cds` > > Aleksey Shipilev has updated the pull request incrementally with three additional commits since the last revision: > > - More cosmetics > - Improve compile ID sorting > - Revert sorting by method count Good. Let me test it. ------------- PR Review: https://git.openjdk.org/leyden/pull/99#pullrequestreview-3536795120 From asmehra at redhat.com Thu Dec 4 03:54:03 2025 From: asmehra at redhat.com (Ashutosh Mehra) Date: Wed, 3 Dec 2025 22:54:03 -0500 Subject: Redundant MethodCounters in preimage Message-ID: Hi, While working on the leyden-analyzer tool Maria mentioned that MethodCounters for a particular method is present in the preimage generated by the training run but not in the final AOT Cache generated by the assembly phase. In fact the number of MethodCounters in the preimage is way more than that in the final image. For an app that Maria was using, map file for training phase shows: $ grep "@@ MethodCounters " t.map |wc -l 22495 and the map file for assembly phase shows: $ grep "@@ MethodCounters " a.map |wc -l 8701 I checked the code and realized there are two ways MethodCounters can get into the cache: 1. through Method (see Method::metaspace_pointers_do) 2. through MethodTrainingData (see MethodTrainingData::metaspace_pointers_do) The first point explains the presence of so many MethodCounters, because any method that gets executed would have its MethodCounters added to the cache. Interestingly the link between Method and MethodCounters is then severed in Method::unlink_method(). That means the Methods adopted from the preimage in the assembly phase do not have MethodCounters. And since we don't execute the application, we don't create MethodCounters for the Methods loaded from the preimage. Only the MethodCounters discoverable via MethodTrainingData seep into the AOTCache. >From the above explanation it looks like not every MethodCounters added to the preimage is useful. Probably we should only be adding MethodCounters through the MethodTrainingData and not through Method. This would also reduce the size of the preimage a bit. In this particular case MethodCounters size was 1437824 bytes in preimage and 551552 bytes in the AOT Cache. Does this make sense? Thanks, - Ashutosh Mehra -------------- next part -------------- An HTML attachment was scrubbed... URL: From mariasde at redhat.com Thu Dec 4 09:00:35 2025 From: mariasde at redhat.com (=?UTF-8?Q?Mar=C3=ADa_Arias_de_Reyna_Dominguez?=) Date: Thu, 4 Dec 2025 10:00:35 +0100 Subject: Redundant MethodCounters in preimage In-Reply-To: References: Message-ID: Hi! Note that there are MethodCounters added on the assembly run that don't have a MethodTrainingData associated. Maybe because those MethodCounters were created during the assembly? And then they don't reflect behaviour during training run but during assembly run? Or they are added via some other link? I don't know. But in any case, if we don't want MethodCounters with a low count value/with no MethodTrainingData, and we are excluding them from the AOT, we should be consistent with that. Example: $ grep -r "boolean java.lang.Class.isHidden()" aot.map* target/aot.map:0x00000008014c6340: @@ Method 104 boolean java.lang.Class.isHidden() target/aot.map:0x00000008014c63a8: @@ MethodCounters 64 boolean java.lang.Class.isHidden() target/aot.map:0x0000000803149908: @@ ConstMethod 64 boolean java.lang.Class.isHidden() target/aot.map.0:0x00000008015b23f0: @@ Method 104 boolean java.lang.Class.isHidden() target/aot.map.0:0x00000008015b2458: @@ MethodCounters 64 boolean java.lang.Class.isHidden() target/aot.map.0:0x00000008031d5430: @@ ConstMethod 64 boolean java.lang.Class.isHidden() It appears in both aot.map files (outputs of training and assembly). Cheers! Mar?a. On Thu, Dec 4, 2025 at 4:55?AM Ashutosh Mehra wrote: > Hi, > While working on the leyden-analyzer tool Maria mentioned that > MethodCounters for a particular method is present in the preimage generated > by the training run but not in the final AOT Cache generated by the > assembly phase. In fact the number of MethodCounters in the preimage is way > more than that in the final image. For an app that Maria was using, map > file for training phase shows: > > $ grep "@@ MethodCounters " t.map |wc -l 22495 > > and the map file for assembly phase shows: > > $ grep "@@ MethodCounters " a.map |wc -l 8701 > > I checked the code and realized there are two ways MethodCounters can get > into the cache: > 1. through Method (see Method::metaspace_pointers_do) > 2. through MethodTrainingData > (see MethodTrainingData::metaspace_pointers_do) > > The first point explains the presence of so many MethodCounters, because > any method that gets executed would have its MethodCounters added to the > cache. > Interestingly the link between Method and MethodCounters is then severed > in Method::unlink_method(). That means the Methods adopted from the > preimage in the assembly phase do not have MethodCounters. And since we > don't execute the application, we don't create MethodCounters for the > Methods loaded from the preimage. Only the MethodCounters discoverable via > MethodTrainingData seep into the AOTCache. > > From the above explanation it looks like not every MethodCounters added to > the preimage is useful. Probably we should only be adding MethodCounters > through the MethodTrainingData and not through Method. This would also > reduce the size of the preimage a bit. In this particular case > MethodCounters size was 1437824 bytes in preimage and 551552 bytes in the > AOT Cache. > > Does this make sense? > > Thanks, > - Ashutosh Mehra > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shade at openjdk.org Thu Dec 4 17:30:01 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 4 Dec 2025 17:30:01 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v6] In-Reply-To: <6cfyj5vijT_yyyjhMynhBXqE6hT2Wt0GUoKqYbFwBIc=.820c4dda-b18d-452b-a10e-e4c8ed32753b@github.com> References: <6cfyj5vijT_yyyjhMynhBXqE6hT2Wt0GUoKqYbFwBIc=.820c4dda-b18d-452b-a10e-e4c8ed32753b@github.com> Message-ID: On Wed, 3 Dec 2025 20:25:38 GMT, Vladimir Kozlov wrote: > Good. Let me test it. Thanks! I hope we can integrate it this year :) ------------- PR Comment: https://git.openjdk.org/leyden/pull/99#issuecomment-3613417063 From asmehra at redhat.com Thu Dec 4 22:15:27 2025 From: asmehra at redhat.com (Ashutosh Mehra) Date: Thu, 4 Dec 2025 17:15:27 -0500 Subject: Redundant MethodCounters in preimage In-Reply-To: References: Message-ID: FYI I have opened https://bugs.openjdk.org/browse/JDK-8373114 to fix it. Thanks, - Ashutosh Mehra On Thu, Dec 4, 2025 at 4:01?AM Mar?a Arias de Reyna Dominguez < mariasde at redhat.com> wrote: > Hi! > > Note that there are MethodCounters added on the assembly run that don't > have a MethodTrainingData associated. Maybe because those MethodCounters > were created during the assembly? And then they don't reflect behaviour > during training run but during assembly run? Or they are added via some > other link? I don't know. But in any case, if we don't want MethodCounters > with a low count value/with no MethodTrainingData, and we are excluding > them from the AOT, we should be consistent with that. > > Example: > > $ grep -r "boolean java.lang.Class.isHidden()" aot.map* > target/aot.map:0x00000008014c6340: @@ Method 104 boolean > java.lang.Class.isHidden() > target/aot.map:0x00000008014c63a8: @@ MethodCounters 64 boolean > java.lang.Class.isHidden() > target/aot.map:0x0000000803149908: @@ ConstMethod 64 boolean > java.lang.Class.isHidden() > target/aot.map.0:0x00000008015b23f0: @@ Method 104 boolean > java.lang.Class.isHidden() > target/aot.map.0:0x00000008015b2458: @@ MethodCounters 64 boolean > java.lang.Class.isHidden() > target/aot.map.0:0x00000008031d5430: @@ ConstMethod 64 boolean > java.lang.Class.isHidden() > > It appears in both aot.map files (outputs of training and assembly). > > Cheers! > Mar?a. > > > On Thu, Dec 4, 2025 at 4:55?AM Ashutosh Mehra wrote: > >> Hi, >> While working on the leyden-analyzer tool Maria mentioned that >> MethodCounters for a particular method is present in the preimage generated >> by the training run but not in the final AOT Cache generated by the >> assembly phase. In fact the number of MethodCounters in the preimage is way >> more than that in the final image. For an app that Maria was using, map >> file for training phase shows: >> >> $ grep "@@ MethodCounters " t.map |wc -l 22495 >> >> and the map file for assembly phase shows: >> >> $ grep "@@ MethodCounters " a.map |wc -l 8701 >> >> I checked the code and realized there are two ways MethodCounters can get >> into the cache: >> 1. through Method (see Method::metaspace_pointers_do) >> 2. through MethodTrainingData >> (see MethodTrainingData::metaspace_pointers_do) >> >> The first point explains the presence of so many MethodCounters, because >> any method that gets executed would have its MethodCounters added to the >> cache. >> Interestingly the link between Method and MethodCounters is then severed >> in Method::unlink_method(). That means the Methods adopted from the >> preimage in the assembly phase do not have MethodCounters. And since we >> don't execute the application, we don't create MethodCounters for the >> Methods loaded from the preimage. Only the MethodCounters discoverable via >> MethodTrainingData seep into the AOTCache. >> >> From the above explanation it looks like not every MethodCounters added >> to the preimage is useful. Probably we should only be adding MethodCounters >> through the MethodTrainingData and not through Method. This would also >> reduce the size of the preimage a bit. In this particular case >> MethodCounters size was 1437824 bytes in preimage and 551552 bytes in the >> AOT Cache. >> >> Does this make sense? >> >> Thanks, >> - Ashutosh Mehra >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shade at openjdk.org Fri Dec 5 05:40:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 5 Dec 2025 05:40:38 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v4] In-Reply-To: References: Message-ID: On Sat, 18 Oct 2025 00:16:00 GMT, Vladimir Kozlov wrote: >> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: >> >> Drop the mention of MDO > > An other suggestion for this concurrent preloading would be to split A4 preload code. One set is the current which needs to wait `compute_java_loaders()`. And new one (much smaller) is for simple methods for classes which are loaded first (String, for example) which we can preload much sooner. Any news on testing, @vnkozlov? ------------- PR Comment: https://git.openjdk.org/leyden/pull/99#issuecomment-3615382305 From kvn at openjdk.org Fri Dec 5 06:18:24 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 5 Dec 2025 06:18:24 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v4] In-Reply-To: References: Message-ID: On Sat, 18 Oct 2025 00:16:00 GMT, Vladimir Kozlov wrote: >> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: >> >> Drop the mention of MDO > > An other suggestion for this concurrent preloading would be to split A4 preload code. One set is the current which needs to wait `compute_java_loaders()`. And new one (much smaller) is for simple methods for classes which are loaded first (String, for example) which we can preload much sooner. > Any news on testing, @vnkozlov? It is still running. There was big backlog of testing jobs. There are several failures. I need to run control testing without these changes to see if failures are new. ------------- PR Comment: https://git.openjdk.org/leyden/pull/99#issuecomment-3615466321 From kvn at openjdk.org Fri Dec 5 18:12:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 5 Dec 2025 18:12:56 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v6] In-Reply-To: References: Message-ID: <9u-kUcbYXH9m3CtW9oPzyMgzlO9s32ky8zGffNQVh1A=.206e7254-bc83-40f1-b4e3-cd4223d42d60@github.com> On Wed, 3 Dec 2025 13:58:45 GMT, Aleksey Shipilev wrote: >> Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. >> >> Notable improvements: >> 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. >> 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. >> >> Additional testing: >> - [x] Performance tests (see comments) >> - [x] Linux x86_64 server fastdebug, `runtime/cds` > > Aleksey Shipilev has updated the pull request incrementally with three additional commits since the last revision: > > - More cosmetics > - Improve compile ID sorting > - Revert sorting by method count Testing results are mess :( for both, these changes and control. But I don't see anything alarming. I approve changes. ------------- Marked as reviewed by kvn (Committer). PR Review: https://git.openjdk.org/leyden/pull/99#pullrequestreview-3545836697 From shade at openjdk.org Fri Dec 5 18:36:35 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 5 Dec 2025 18:36:35 GMT Subject: RFR: 8368465: [leyden] Improve precompiler method selection code [v6] In-Reply-To: References: Message-ID: <2iqCGGcLLi5Bdamod54rVSUri0VGMbpaYyEkXbo0430=.2b2218ac-52f4-4d50-808a-58228f1cd271@github.com> On Wed, 3 Dec 2025 13:58:45 GMT, Aleksey Shipilev wrote: >> Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. >> >> Notable improvements: >> 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. >> 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. >> >> Additional testing: >> - [x] Performance tests (see comments) >> - [x] Linux x86_64 server fastdebug, `runtime/cds` > > Aleksey Shipilev has updated the pull request incrementally with three additional commits since the last revision: > > - More cosmetics > - Improve compile ID sorting > - Revert sorting by method count Thanks! Here goes. ------------- PR Comment: https://git.openjdk.org/leyden/pull/99#issuecomment-3618053352 From shade at openjdk.org Fri Dec 5 18:36:35 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 5 Dec 2025 18:36:35 GMT Subject: Integrated: 8368465: [leyden] Improve precompiler method selection code In-Reply-To: References: Message-ID: <_YV8jD_1A4nSDJP9DH7jo5h432xIn55wXQzdnxRuVh4=.ba36eafc-cbb6-4a77-8d9b-cee5384d5195@github.com> On Tue, 23 Sep 2025 12:33:23 GMT, Aleksey Shipilev wrote: > Forked from [JDK-8366681](https://bugs.openjdk.org/browse/JDK-8366681): there are still some cleanups/performance improvements possible. Current selection code is a bit hairy, and turns out the changes I made for previous patch improve performance. > > Notable improvements: > 1. Push the compilation level filters downwards. This allows compiling A2 from T2/T3 code more easily, and allows to implement policies for compiling on any A* level based on observing top-compiled T* levels. > 2. Sort methods by hotness and code size. This looks to have a positive effect on shorter workloads, I suspect because we are avoiding a lot of C1 compilations by preloading hottest code first. > > Additional testing: > - [x] Performance tests (see comments) > - [x] Linux x86_64 server fastdebug, `runtime/cds` This pull request has now been integrated. Changeset: 9c83531e Author: Aleksey Shipilev URL: https://git.openjdk.org/leyden/commit/9c83531e88020f5762116020bcf472511058cffd Stats: 99 lines in 2 files changed: 39 ins; 27 del; 33 mod 8368465: [leyden] Improve precompiler method selection code Reviewed-by: kvn ------------- PR: https://git.openjdk.org/leyden/pull/99 From shade at openjdk.org Fri Dec 5 18:38:07 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 5 Dec 2025 18:38:07 GMT Subject: git: openjdk/leyden: premain: 8368465: [leyden] Improve precompiler method selection code Message-ID: Changeset: 9c83531e Branch: premain Author: Aleksey Shipilev Date: 2025-12-05 18:33:47 +0000 URL: https://git.openjdk.org/leyden/commit/9c83531e88020f5762116020bcf472511058cffd 8368465: [leyden] Improve precompiler method selection code Reviewed-by: kvn ! src/hotspot/share/compiler/precompiler.cpp ! src/hotspot/share/compiler/precompiler.hpp From shade at openjdk.org Mon Dec 8 13:32:42 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 8 Dec 2025 13:32:42 GMT Subject: RFR: 8373256: [leyden] Pack DataKind more densely in archive Message-ID: Spotting a little inefficiency when looking at related code. `DataKind` is always stored as `int`, but its values are actually comfortably fitting in byte. Going to `int8_t` saves about 1% of AOT cache size. If this ever becomes a problem, we can always revert back to `int32_t`. I looked around other uses of `write_bytes`, and I believe `DataKind` is the most obvious opportunity. Additional testing: - [x] Linux x86_64 server fastdebug, `runtime/cds` ------------- Commit messages: - Fix Changes: https://git.openjdk.org/leyden/pull/105/files Webrev: https://webrevs.openjdk.org/?repo=leyden&pr=105&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8373256 Stats: 23 lines in 2 files changed: 0 ins; 0 del; 23 mod Patch: https://git.openjdk.org/leyden/pull/105.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/105/head:pull/105 PR: https://git.openjdk.org/leyden/pull/105 From kvn at openjdk.org Mon Dec 8 15:36:01 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 8 Dec 2025 15:36:01 GMT Subject: RFR: 8373256: [leyden] Pack DataKind more densely in archive In-Reply-To: References: Message-ID: On Mon, 8 Dec 2025 13:25:29 GMT, Aleksey Shipilev wrote: > Spotting a little inefficiency when looking at related code. `DataKind` is always stored as `int`, but its values are actually comfortably fitting in byte. Going to `int8_t` saves about 1% of AOT cache size. If this ever becomes a problem, we can always revert back to `int32_t`. I looked around other uses of `write_bytes`, and I believe `DataKind` is the most obvious opportunity. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `runtime/cds` I don't think it is good optimization. The following data is assumed 4-bytes aligned when we read it. See `read_method()` and `read_klass()`. ------------- PR Review: https://git.openjdk.org/leyden/pull/105#pullrequestreview-3552662068 From shade at openjdk.org Mon Dec 8 15:53:22 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 8 Dec 2025 15:53:22 GMT Subject: RFR: 8373256: [leyden] Enforce aligned reads/writes in code cache archive In-Reply-To: References: Message-ID: On Mon, 8 Dec 2025 15:33:31 GMT, Vladimir Kozlov wrote: > `read_method` Oh. Yes, that makes sense. So we should actually go the other way around: assert that we never do unaligned reads, otherwise some platforms would give us performance surprises; if not UB. Let me go that way then. ------------- PR Comment: https://git.openjdk.org/leyden/pull/105#issuecomment-3627637306 From asmehra at openjdk.org Tue Dec 9 21:49:44 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Tue, 9 Dec 2025 21:49:44 GMT Subject: RFR: Experiment with storing target method for static and opt-virtual callsites in reloc info Message-ID: This work aims to reduce the time taken to perform call resolution by caching the result of direct calls (static and opt-virtual) in the reloc info during compilation of a method. Relocations for static and opt-virtual calls already have a field `method_index` which is used to store the "real" method to be invoked by the method handle. It is currently only used during c2 compilations. This patch re-uses the `method_index` field for static and opt-virtual calls to store the target method. The runtime call (`SharedRuntime::resolve_helper`) used by the compiled code to perform the call site resolution can then optimize the resolution process by getting the target method from the reloc info and patches the callsite through CompiledDirectCall. No special handling is needed for AOT code. On a 4-cpu system there is around 3% improvement in `spring-boot-getting-started`. Numbers for JavacBench range between 0-3% improvement. `spring-boot-getting-started`: Run,Old CDS + AOT,New CDS + AOT 1,199,192 2,199,196 3,202,197 4,203,198 5,198,196 6,201,194 7,203,197 8,200,193 9,204,193 10,199,201 Geomean,200.79,195.68 (1.03x improvement) Stdev,1.99,2.61 `-Xlog:init` shows the numbers for time spent in call resolution from the compiled code. For `spring-boot-getting-started` before this patch: [0.357s][info][init] SharedRuntime: [0.357s][info][init] resolve_opt_virtual_call: 8260us / 2249 events [0.357s][info][init] resolve_virtual_call: 6899us / 1297 events [0.357s][info][init] resolve_static_call: 4646us / 1723 events [0.357s][info][init] handle_wrong_method: 680us / 145 events [0.357s][info][init] ic_miss: 2109us / 488 events [0.357s][info][init] Total: 22596us [0.357s][info][init] perf_resolve_static_cache_hit_ctr: 0 [0.357s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 0 For `spring-boot-getting-started` after this patch: [0.348s][info][init] SharedRuntime: [0.348s][info][init] resolve_opt_virtual_call: 2774us / 2251 events [0.348s][info][init] resolve_virtual_call: 5577us / 1294 events [0.348s][info][init] resolve_static_call: 1901us / 1728 events [0.348s][info][init] handle_wrong_method: 719us / 146 events [0.348s][info][init] ic_miss: 2109us / 474 events [0.348s][info][init] Total: 13082us [0.348s][info][init] perf_resolve_static_cache_hit_ctr: 1704 [0.348s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 2202 For JavacBench before this patch: [0.406s][info][init] SharedRuntime: [0.406s][info][init] resolve_opt_virtual_call: 7146us / 2354 events [0.406s][info][init] resolve_virtual_call: 7160us / 2207 events [0.406s][info][init] resolve_static_call: 2992us / 1264 events [0.406s][info][init] handle_wrong_method: 728us / 186 events [0.406s][info][init] ic_miss: 2389us / 675 events [0.406s][info][init] Total: 20416us [0.406s][info][init] perf_resolve_static_cache_hit_ctr: 0 [0.406s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 0 For JavacBench after this patch: [0.399s][info][init] SharedRuntime: [0.399s][info][init] resolve_opt_virtual_call: 2321us / 2346 events [0.399s][info][init] resolve_virtual_call: 7452us / 2213 events [0.399s][info][init] resolve_static_call: 1264us / 1258 events [0.399s][info][init] handle_wrong_method: 747us / 177 events [0.399s][info][init] ic_miss: 2395us / 665 events [0.399s][info][init] Total: 14180us [0.399s][info][init] perf_resolve_static_cache_hit_ctr: 1212 [0.399s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 2236 ------------- Commit messages: - Experiment with storing target method for static and opt-virtual call sites in reloc info Changes: https://git.openjdk.org/leyden/pull/106/files Webrev: https://webrevs.openjdk.org/?repo=leyden&pr=106&range=00 Stats: 231 lines in 22 files changed: 156 ins; 7 del; 68 mod Patch: https://git.openjdk.org/leyden/pull/106.diff Fetch: git fetch https://git.openjdk.org/leyden.git pull/106/head:pull/106 PR: https://git.openjdk.org/leyden/pull/106 From asmehra at openjdk.org Wed Dec 10 02:08:45 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 10 Dec 2025 02:08:45 GMT Subject: RFR: Experiment with storing target method for static and opt-virtual callsites in reloc info In-Reply-To: References: Message-ID: <6A9jxjGZak5oYPnfoSBY-SoHSin3BxImeHkR-9BJFjk=.e7aa26cf-44d7-4aed-a83e-25128bdd858b@github.com> On Tue, 9 Dec 2025 21:15:23 GMT, Ashutosh Mehra wrote: > This work aims to reduce the time taken to perform call resolution by caching the result of direct calls (static and opt-virtual) in the reloc info during compilation of a method. > Relocations for static and opt-virtual calls already have a field `method_index` which is used to store the "real" method to be invoked by the method handle. It is currently only used during c2 compilations. > This patch re-uses the `method_index` field for static and opt-virtual calls to store the target method. The runtime call (`SharedRuntime::resolve_helper`) used by the compiled code to perform the call site resolution can then optimize the resolution process by getting the target method from the reloc info and patches the callsite through CompiledDirectCall. > No special handling is needed for AOT code. > > On a 4-cpu system there is around 3% improvement in `spring-boot-getting-started`. Numbers for JavacBench range between 0-3% improvement. > > `spring-boot-getting-started`: > > Run,Old CDS + AOT,New CDS + AOT > 1,199,192 > 2,199,196 > 3,202,197 > 4,203,198 > 5,198,196 > 6,201,194 > 7,203,197 > 8,200,193 > 9,204,193 > 10,199,201 > Geomean,200.79,195.68 (1.03x improvement) > Stdev,1.99,2.61 > > > `-Xlog:init` shows the numbers for time spent in call resolution from the compiled code. > For `spring-boot-getting-started` before this patch: > > [0.357s][info][init] SharedRuntime: > [0.357s][info][init] resolve_opt_virtual_call: 8260us / 2249 events > [0.357s][info][init] resolve_virtual_call: 6899us / 1297 events > [0.357s][info][init] resolve_static_call: 4646us / 1723 events > [0.357s][info][init] handle_wrong_method: 680us / 145 events > [0.357s][info][init] ic_miss: 2109us / 488 events > [0.357s][info][init] Total: 22596us > [0.357s][info][init] perf_resolve_static_cache_hit_ctr: 0 > [0.357s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 0 > > > For `spring-boot-getting-started` after this patch: > > [0.348s][info][init] SharedRuntime: > [0.348s][info][init] resolve_opt_virtual_call: 2774us / 2251 events > [0.348s][info][init] resolve_virtual_call: 5577us / 1294 events > [0.348s][info][init] resolve_static_call: 1901us / 1728 events > [0.348s][info][init] handle_wrong_method: 719us / 146 events > [0.348s][info][init] ic_miss: 2109us / 474 events > [0.348s][info][init] Total: 13082us > [0.348s][info][init] perf_resolve_static_cache_hit_ctr: 1704 > ... To make it convenient to measure perf impact the change in `SharedRuntime::resolve_helper` is protected by `UseNewCode2` flag. If these changes make sense I will remove this flag before integrating. ------------- PR Comment: https://git.openjdk.org/leyden/pull/106#issuecomment-3635039448 From asmehra at openjdk.org Wed Dec 10 03:27:00 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 10 Dec 2025 03:27:00 GMT Subject: RFR: Experiment with storing target method for static and opt-virtual callsites in reloc info In-Reply-To: References: Message-ID: On Tue, 9 Dec 2025 21:15:23 GMT, Ashutosh Mehra wrote: > This work aims to reduce the time taken to perform call resolution by caching the result of direct calls (static and opt-virtual) in the reloc info during compilation of a method. > Relocations for static and opt-virtual calls already have a field `method_index` which is used to store the "real" method to be invoked by the method handle. It is currently only used during c2 compilations. > This patch re-uses the `method_index` field for static and opt-virtual calls to store the target method. The runtime call (`SharedRuntime::resolve_helper`) used by the compiled code to perform the call site resolution can then optimize the resolution process by getting the target method from the reloc info and patches the callsite through CompiledDirectCall. > No special handling is needed for AOT code. > > On a 4-cpu system there is around 3% improvement in `spring-boot-getting-started`. Numbers for JavacBench range between 0-3% improvement. > > `spring-boot-getting-started`: > > Run,Old CDS + AOT,New CDS + AOT > 1,199,192 > 2,199,196 > 3,202,197 > 4,203,198 > 5,198,196 > 6,201,194 > 7,203,197 > 8,200,193 > 9,204,193 > 10,199,201 > Geomean,200.79,195.68 (1.03x improvement) > Stdev,1.99,2.61 > > > `-Xlog:init` shows the numbers for time spent in call resolution from the compiled code. > For `spring-boot-getting-started` before this patch: > > [0.357s][info][init] SharedRuntime: > [0.357s][info][init] resolve_opt_virtual_call: 8260us / 2249 events > [0.357s][info][init] resolve_virtual_call: 6899us / 1297 events > [0.357s][info][init] resolve_static_call: 4646us / 1723 events > [0.357s][info][init] handle_wrong_method: 680us / 145 events > [0.357s][info][init] ic_miss: 2109us / 488 events > [0.357s][info][init] Total: 22596us > [0.357s][info][init] perf_resolve_static_cache_hit_ctr: 0 > [0.357s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 0 > > > For `spring-boot-getting-started` after this patch: > > [0.348s][info][init] SharedRuntime: > [0.348s][info][init] resolve_opt_virtual_call: 2774us / 2251 events > [0.348s][info][init] resolve_virtual_call: 5577us / 1294 events > [0.348s][info][init] resolve_static_call: 1901us / 1728 events > [0.348s][info][init] handle_wrong_method: 719us / 146 events > [0.348s][info][init] ic_miss: 2109us / 474 events > [0.348s][info][init] Total: 13082us > [0.348s][info][init] perf_resolve_static_cache_hit_ctr: 1704 > ... @vnkozlov @adinn @iwanowww fyi ------------- PR Comment: https://git.openjdk.org/leyden/pull/106#issuecomment-3635194139 From vlivanov at openjdk.org Wed Dec 10 21:16:38 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 10 Dec 2025 21:16:38 GMT Subject: RFR: Experiment with storing target method for static and opt-virtual callsites in reloc info In-Reply-To: References: Message-ID: <_7yoKO9r6-ML5sPBL3o8q16KPIAAbtVh0wYcPMYhCMc=.f01e36a2-01df-418a-b9ab-b56e5c9b2d7b@github.com> On Tue, 9 Dec 2025 21:15:23 GMT, Ashutosh Mehra wrote: > This work aims to reduce the time taken to perform call resolution by caching the result of direct calls (static and opt-virtual) in the reloc info during compilation of a method. > Relocations for static and opt-virtual calls already have a field `method_index` which is used to store the "real" method to be invoked by the method handle. It is currently only used during c2 compilations. > This patch re-uses the `method_index` field for static and opt-virtual calls to store the target method. The runtime call (`SharedRuntime::resolve_helper`) used by the compiled code to perform the call site resolution can then optimize the resolution process by getting the target method from the reloc info and patches the callsite through CompiledDirectCall. > No special handling is needed for AOT code. > > On a 4-cpu system there is around 3% improvement in `spring-boot-getting-started`. Numbers for JavacBench range between 0-3% improvement. > > `spring-boot-getting-started`: > > Run,Old CDS + AOT,New CDS + AOT > 1,199,192 > 2,199,196 > 3,202,197 > 4,203,198 > 5,198,196 > 6,201,194 > 7,203,197 > 8,200,193 > 9,204,193 > 10,199,201 > Geomean,200.79,195.68 (1.03x improvement) > Stdev,1.99,2.61 > > > `-Xlog:init` shows the numbers for time spent in call resolution from the compiled code. > For `spring-boot-getting-started` before this patch: > > [0.357s][info][init] SharedRuntime: > [0.357s][info][init] resolve_opt_virtual_call: 8260us / 2249 events > [0.357s][info][init] resolve_virtual_call: 6899us / 1297 events > [0.357s][info][init] resolve_static_call: 4646us / 1723 events > [0.357s][info][init] handle_wrong_method: 680us / 145 events > [0.357s][info][init] ic_miss: 2109us / 488 events > [0.357s][info][init] Total: 22596us > [0.357s][info][init] perf_resolve_static_cache_hit_ctr: 0 > [0.357s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 0 > > > For `spring-boot-getting-started` after this patch: > > [0.348s][info][init] SharedRuntime: > [0.348s][info][init] resolve_opt_virtual_call: 2774us / 2251 events > [0.348s][info][init] resolve_virtual_call: 5577us / 1294 events > [0.348s][info][init] resolve_static_call: 1901us / 1728 events > [0.348s][info][init] handle_wrong_method: 719us / 146 events > [0.348s][info][init] ic_miss: 2109us / 474 events > [0.348s][info][init] Total: 13082us > [0.348s][info][init] perf_resolve_static_cache_hit_ctr: 1704 > ... Nice work! src/hotspot/share/runtime/sharedRuntime.cpp line 114: > 112: PerfTickCounters* SharedRuntime::_perf_ic_miss_total_time = nullptr; > 113: > 114: uint SharedRuntime::_perf_resolve_static_cache_hit_ctr = 0; PerfCounters are usually more convenient to use than raw counters. For example, they can be sampled on-the-fly from a live process. src/hotspot/share/runtime/sharedRuntime.cpp line 1528: > 1526: > 1527: if (UseNewCode2) { > 1528: bool is_mhi; I believe disabling inlining through MH linkers when generating archived code should simplify things. Then, there should be no attached methods for MH linkers in archived code and vise-versa. ------------- PR Review: https://git.openjdk.org/leyden/pull/106#pullrequestreview-3564491705 PR Review Comment: https://git.openjdk.org/leyden/pull/106#discussion_r2608217693 PR Review Comment: https://git.openjdk.org/leyden/pull/106#discussion_r2608224195 From asmehra at openjdk.org Wed Dec 10 22:42:50 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 10 Dec 2025 22:42:50 GMT Subject: RFR: Experiment with storing target method for static and opt-virtual callsites in reloc info In-Reply-To: <_7yoKO9r6-ML5sPBL3o8q16KPIAAbtVh0wYcPMYhCMc=.f01e36a2-01df-418a-b9ab-b56e5c9b2d7b@github.com> References: <_7yoKO9r6-ML5sPBL3o8q16KPIAAbtVh0wYcPMYhCMc=.f01e36a2-01df-418a-b9ab-b56e5c9b2d7b@github.com> Message-ID: On Wed, 10 Dec 2025 21:09:34 GMT, Vladimir Ivanov wrote: >> This work aims to reduce the time taken to perform call resolution by caching the result of direct calls (static and opt-virtual) in the reloc info during compilation of a method. >> Relocations for static and opt-virtual calls already have a field `method_index` which is used to store the "real" method to be invoked by the method handle. It is currently only used during c2 compilations. >> This patch re-uses the `method_index` field for static and opt-virtual calls to store the target method. The runtime call (`SharedRuntime::resolve_helper`) used by the compiled code to perform the call site resolution can then optimize the resolution process by getting the target method from the reloc info and patches the callsite through CompiledDirectCall. >> No special handling is needed for AOT code. >> >> On a 4-cpu system there is around 3% improvement in `spring-boot-getting-started`. Numbers for JavacBench range between 0-3% improvement. >> >> `spring-boot-getting-started`: >> >> Run,Old CDS + AOT,New CDS + AOT >> 1,199,192 >> 2,199,196 >> 3,202,197 >> 4,203,198 >> 5,198,196 >> 6,201,194 >> 7,203,197 >> 8,200,193 >> 9,204,193 >> 10,199,201 >> Geomean,200.79,195.68 (1.03x improvement) >> Stdev,1.99,2.61 >> >> >> `-Xlog:init` shows the numbers for time spent in call resolution from the compiled code. >> For `spring-boot-getting-started` before this patch: >> >> [0.357s][info][init] SharedRuntime: >> [0.357s][info][init] resolve_opt_virtual_call: 8260us / 2249 events >> [0.357s][info][init] resolve_virtual_call: 6899us / 1297 events >> [0.357s][info][init] resolve_static_call: 4646us / 1723 events >> [0.357s][info][init] handle_wrong_method: 680us / 145 events >> [0.357s][info][init] ic_miss: 2109us / 488 events >> [0.357s][info][init] Total: 22596us >> [0.357s][info][init] perf_resolve_static_cache_hit_ctr: 0 >> [0.357s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 0 >> >> >> For `spring-boot-getting-started` after this patch: >> >> [0.348s][info][init] SharedRuntime: >> [0.348s][info][init] resolve_opt_virtual_call: 2774us / 2251 events >> [0.348s][info][init] resolve_virtual_call: 5577us / 1294 events >> [0.348s][info][init] resolve_static_call: 1901us / 1728 events >> [0.348s][info][init] handle_wrong_method: 719us / 146 events >> [0.348s][info][init] ic_miss: 2109us / 474 events >> [0.348s][info][init] Total:... > > src/hotspot/share/runtime/sharedRuntime.cpp line 114: > >> 112: PerfTickCounters* SharedRuntime::_perf_ic_miss_total_time = nullptr; >> 113: >> 114: uint SharedRuntime::_perf_resolve_static_cache_hit_ctr = 0; > > PerfCounters are usually more convenient to use than raw counters. For example, they can be sampled on-the-fly from a live process. Okay. I used these counters just to do a quick check how much static call resolution can be optimized this way. If we go with this approach I will try to replace them with PerfCounters or even get rid of these counters if they are not needed. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/106#discussion_r2608459368 From asmehra at openjdk.org Wed Dec 10 22:38:20 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 10 Dec 2025 22:38:20 GMT Subject: RFR: Experiment with storing target method for static and opt-virtual callsites in reloc info In-Reply-To: <_7yoKO9r6-ML5sPBL3o8q16KPIAAbtVh0wYcPMYhCMc=.f01e36a2-01df-418a-b9ab-b56e5c9b2d7b@github.com> References: <_7yoKO9r6-ML5sPBL3o8q16KPIAAbtVh0wYcPMYhCMc=.f01e36a2-01df-418a-b9ab-b56e5c9b2d7b@github.com> Message-ID: On Wed, 10 Dec 2025 21:12:24 GMT, Vladimir Ivanov wrote: >> This work aims to reduce the time taken to perform call resolution by caching the result of direct calls (static and opt-virtual) in the reloc info during compilation of a method. >> Relocations for static and opt-virtual calls already have a field `method_index` which is used to store the "real" method to be invoked by the method handle. It is currently only used during c2 compilations. >> This patch re-uses the `method_index` field for static and opt-virtual calls to store the target method. The runtime call (`SharedRuntime::resolve_helper`) used by the compiled code to perform the call site resolution can then optimize the resolution process by getting the target method from the reloc info and patches the callsite through CompiledDirectCall. >> No special handling is needed for AOT code. >> >> On a 4-cpu system there is around 3% improvement in `spring-boot-getting-started`. Numbers for JavacBench range between 0-3% improvement. >> >> `spring-boot-getting-started`: >> >> Run,Old CDS + AOT,New CDS + AOT >> 1,199,192 >> 2,199,196 >> 3,202,197 >> 4,203,198 >> 5,198,196 >> 6,201,194 >> 7,203,197 >> 8,200,193 >> 9,204,193 >> 10,199,201 >> Geomean,200.79,195.68 (1.03x improvement) >> Stdev,1.99,2.61 >> >> >> `-Xlog:init` shows the numbers for time spent in call resolution from the compiled code. >> For `spring-boot-getting-started` before this patch: >> >> [0.357s][info][init] SharedRuntime: >> [0.357s][info][init] resolve_opt_virtual_call: 8260us / 2249 events >> [0.357s][info][init] resolve_virtual_call: 6899us / 1297 events >> [0.357s][info][init] resolve_static_call: 4646us / 1723 events >> [0.357s][info][init] handle_wrong_method: 680us / 145 events >> [0.357s][info][init] ic_miss: 2109us / 488 events >> [0.357s][info][init] Total: 22596us >> [0.357s][info][init] perf_resolve_static_cache_hit_ctr: 0 >> [0.357s][info][init] perf_resolve_opt_virtual_cache_hit_ctr: 0 >> >> >> For `spring-boot-getting-started` after this patch: >> >> [0.348s][info][init] SharedRuntime: >> [0.348s][info][init] resolve_opt_virtual_call: 2774us / 2251 events >> [0.348s][info][init] resolve_virtual_call: 5577us / 1294 events >> [0.348s][info][init] resolve_static_call: 1901us / 1728 events >> [0.348s][info][init] handle_wrong_method: 719us / 146 events >> [0.348s][info][init] ic_miss: 2109us / 474 events >> [0.348s][info][init] Total:... > > src/hotspot/share/runtime/sharedRuntime.cpp line 1528: > >> 1526: >> 1527: if (UseNewCode2) { >> 1528: bool is_mhi; > > I believe disabling inlining through MH linkers when generating archived code should simplify things. Then, there should be no attached methods for MH linkers in archived code and vise-versa. Are you suggesting disable inlining through MH linkers for both aot and jit code, or only for the aot code? If we do only for the aot code, it wouldn't help unless we decided to do this optimization only for the aot code. As it stands, it benefits bot jit and aot code. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/106#discussion_r2608451163 From vlivanov at openjdk.org Wed Dec 10 22:56:18 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Wed, 10 Dec 2025 22:56:18 GMT Subject: RFR: Experiment with storing target method for static and opt-virtual callsites in reloc info In-Reply-To: References: <_7yoKO9r6-ML5sPBL3o8q16KPIAAbtVh0wYcPMYhCMc=.f01e36a2-01df-418a-b9ab-b56e5c9b2d7b@github.com> Message-ID: On Wed, 10 Dec 2025 22:35:59 GMT, Ashutosh Mehra wrote: >> src/hotspot/share/runtime/sharedRuntime.cpp line 1528: >> >>> 1526: >>> 1527: if (UseNewCode2) { >>> 1528: bool is_mhi; >> >> I believe disabling inlining through MH linkers when generating archived code should simplify things. Then, there should be no attached methods for MH linkers in archived code and vise-versa. > > Are you suggesting disable inlining through MH linkers for both aot and jit code, or only for the aot code? If we do only for the aot code, it wouldn't help unless we decided to do this optimization only for the aot code. As it stands, it benefits bot jit and aot code. I'd assume it is less important for JITed code. The problem is so acute for AOTed code because it's so cheap to retrieve and install it, so we have plenty of AOT code published in a short period during application startup. ------------- PR Review Comment: https://git.openjdk.org/leyden/pull/106#discussion_r2608489290 From mariasde at redhat.com Mon Dec 15 09:35:23 2025 From: mariasde at redhat.com (=?UTF-8?Q?Mar=C3=ADa_Arias_de_Reyna_Dominguez?=) Date: Mon, 15 Dec 2025 10:35:23 +0100 Subject: Does it make sense to tweak compilation for training? Message-ID: Hi! While searching for good documentation to link to when explaining Leyden, I found this (rather old) page: https://www.ibm.com/docs/en/sdk-java-technology/8?topic=options-xjit-xnojit with compilation options I didn't know existed. And I was wondering: does it make sense to force some things during training run to make sure we get the best training? I'm thinking for example on forcing a high level of compilation on some methods, so we arrive to production with more things optimized. Or excluding some "testing framework" methods from compilation. Or is it better not to touch anything and let Java run normally because this may become too unpredictable? Or... should I play with this and see what happens because we don't really know? :) Kind regards, Mar?a Arias de Reyna Dom?nguez Senior Software Engineer She / Her / Hers ariasdereyna at redhat.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dan.heidinga at oracle.com Mon Dec 15 15:56:19 2025 From: dan.heidinga at oracle.com (Dan Heidinga) Date: Mon, 15 Dec 2025 15:56:19 +0000 Subject: Does it make sense to tweak compilation for training? In-Reply-To: References: Message-ID: That?s a great question - restating my understanding of it: ?Is it useful to hardcode specific JIT options that compile early, force methods to be compiled, etc while training?? We?ve always said that training should match production as closely as possible. The more closely they match (ie: training in production using canary deployments) the better as the training produces training data that exactly matches what the JVM is doing during startup and warmup (and will likely continue to do in future deployments). If we tweak the JIT settings from the side for training, we may distort the training run in ways that are less useful and result in us missing training data (like profiles) that would be useful if we need to deopt+recompile in the production run. That?s my long way of saying if you wouldn?t deploy those options on your production runs, you probably don?t want them on your training runs either. ?Dan From: leyden-dev on behalf of Mar?a Arias de Reyna Dominguez Date: Monday, December 15, 2025 at 4:36?AM To: leyden-dev Subject: Does it make sense to tweak compilation for training? Hi! While searching for good documentation to link to when explaining Leyden, I found this (rather old) page: https://www.ibm.com/docs/en/sdk-java-technology/8?topic=options-xjit-xnojit with compilation options I didn't know existed. And I was wondering: does it make sense to force some things during training run to make sure we get the best training? I'm thinking for example on forcing a high level of compilation on some methods, so we arrive to production with more things optimized. Or excluding some "testing framework" methods from compilation. Or is it better not to touch anything and let Java run normally because this may become too unpredictable? Or... should I play with this and see what happens because we don't really know? :) Kind regards, Mar?a Arias de Reyna Dom?nguez Senior Software Engineer She / Her / Hers ariasdereyna at redhat.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmehra at redhat.com Mon Dec 15 16:51:22 2025 From: asmehra at redhat.com (Ashutosh Mehra) Date: Mon, 15 Dec 2025 11:51:22 -0500 Subject: Does it make sense to tweak compilation for training? In-Reply-To: References: Message-ID: I agree with Dan. We don't want users to force compilation of methods, rather we should let that happen organically. As for excluding the test framework from compilation, we already have options to do that in the premain branch. For instance there is DontPrecompile to exclude a method from AOT compilation in the assembly phase. There is IgnorePrecompiled to ignore the AOT code for a method in the production run. Apart from the canary deployments, there is another scenario where the users may have an offline training run and drive it through the synthetic load. In such a scenario, it may make sense to lower the execution count threshold that triggers compilation to reduce the execution time of the training run required to get a good quality cache. For example if the user knows beforehand a hot code path of the application, then the compilation of this code path can be triggered early by reducing the threshold. Theoretically this sounds useful, but in practice it can be difficult to do this right. @Maria, btw the link you referred to is for IBM SDK which is likely based on OpenJ9. I don't think these options exist for Hotspot. But there may be options in Hotspot with similar behavior. My understanding is most of these options are for diagnostic purposes, not for end users to tune their application. Thanks, - Ashutosh Mehra On Mon, Dec 15, 2025 at 10:58?AM Dan Heidinga wrote: > That?s a great question - restating my understanding of it: ?Is it useful > to hardcode specific JIT options that compile early, force methods to be > compiled, etc while training?? > > We?ve always said that training should match production as closely as > possible. The more closely they match (ie: training in production using > canary deployments) the better as the training produces training data that > exactly matches what the JVM is doing during startup and warmup (and will > likely continue to do in future deployments). > > If we tweak the JIT settings from the side for training, we may distort > the training run in ways that are less useful and result in us missing > training data (like profiles) that would be useful if we need to > deopt+recompile in the production run. > > That?s my long way of saying if you wouldn?t deploy those options on your > production runs, you probably don?t want them on your training runs either. > > ?Dan > > *From: *leyden-dev on behalf of Mar?a Arias > de Reyna Dominguez > *Date: *Monday, December 15, 2025 at 4:36?AM > *To: *leyden-dev > *Subject: *Does it make sense to tweak compilation for training? > > Hi! > > While searching for good documentation to link to when explaining Leyden, > I found this (rather old) page: > https://www.ibm.com/docs/en/sdk-java-technology/8?topic=options-xjit-xnojit with > compilation options I didn't know existed. > > And I was wondering: does it make sense to force some things during > training run to make sure we get the best training? I'm thinking for > example on forcing a high level of compilation on some methods, so we > arrive to production with more things optimized. Or excluding some "testing > framework" methods from compilation. > > Or is it better not to touch anything and let Java run normally because > this may become too unpredictable? > > Or... should I play with this and see what happens because we don't really > know? :) > > Kind regards, > Mar?a Arias de Reyna Dom?nguez > Senior Software Engineer > She / Her / Hers > ariasdereyna at redhat.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kozlov at oracle.com Tue Dec 16 19:26:10 2025 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 16 Dec 2025 11:26:10 -0800 Subject: Does it make sense to tweak compilation for training? In-Reply-To: References: Message-ID: I also agree with Dan that training run should match as much as possible to production runs regarding VM options. There are a LOT of JIT compiler options in HotSpot VM. But the only option we are planning to use is to support AOT portability which will define which set of CPU instructions could be used by JIT compilers during training and assembly phase. I don't think lowering threshold for compilation will give you better code. But may be we can consider lowering threshold when we create MDO. Currently MDO creation is targeting stable peek performance and created later. May be for AOT code we need separate MDO to collect profiling data for startup. Thanks, Vladimir K On 12/15/25 8:51 AM, Ashutosh Mehra wrote: > I agree with Dan. We don't want users to force compilation of methods, > rather we should let that happen organically. > > As for excluding?the test framework from compilation, we already have > options to do that in the premain branch. > For instance there is DontPrecompile to exclude a method from AOT > compilation in the assembly phase. > There is IgnorePrecompiled to ignore the AOT code for a method in the > production run. > > Apart from the canary deployments, there is another scenario where the > users may have an offline training run and drive it through the > synthetic load. > In such a scenario, it may make sense to lower the execution?count > threshold that triggers compilation to reduce the execution time of the > training run required to get a good quality?cache. > For example if the user knows beforehand a hot code path of the > application, then the compilation of this code path can be triggered > early by reducing the threshold. > Theoretically this sounds useful, but in practice it can be difficult to > do this right. > > @Maria, btw the link you referred to is for IBM SDK which is likely > based on OpenJ9. I don't think these options exist for Hotspot. > But there may be options in Hotspot with similar behavior. My > understanding is most of these options are for diagnostic purposes, not > for end users to tune their application. > > Thanks, > - Ashutosh Mehra > > > On Mon, Dec 15, 2025 at 10:58?AM Dan Heidinga > wrote: > > That?s a great question - restating my understanding of it: ?Is it > useful to hardcode specific JIT options that compile early, force > methods to be compiled, etc while training?? > > We?ve always said that training should match production as closely > as possible.? The more closely they match (ie: training in > production using canary deployments) the better as the training > produces training data that exactly matches what the JVM is doing > during startup and warmup (and will likely continue to do in future > deployments). > > If we tweak the JIT settings from the side for training, we may > distort the training run in ways that are less useful and result in > us missing training data (like profiles) that would be useful if we > need to deopt+recompile in the production run. > > That?s my long way of saying if you wouldn?t deploy those options on > your production runs, you probably don?t want them on your training > runs either. > > ?Dan > > *From: *leyden-dev retn at openjdk.org>> on behalf of Mar?a Arias de Reyna Dominguez > > > *Date: *Monday, December 15, 2025 at 4:36?AM > *To: *leyden-dev dev at openjdk.org>> > *Subject: *Does it make sense to tweak compilation for training? > > Hi! > > While searching for good documentation to link to when explaining > Leyden, I found this (rather old) page: https://www.ibm.com/docs/en/ > sdk-java-technology/8?topic=options-xjit-xnojit www.ibm.com/docs/en/sdk-java-technology/8?topic=options-xjit- > xnojit>?with compilation options I didn't know existed. > > And I was wondering: does it make sense to force some things during > training run to make sure we get the best training? I'm thinking for > example on forcing a high level of compilation on some methods, so > we arrive to production with more things optimized. Or excluding > some "testing framework" methods from compilation. > > Or is it better not to touch anything and let Java run normally > because this may become too unpredictable? > > Or... should I play with this and see what happens because we don't > really know? :) > > Kind regards, > Mar?a Arias de Reyna Dom?nguez > Senior Software Engineer > She / Her / Hers > ariasdereyna at redhat.com > From mariasde at redhat.com Tue Dec 30 09:12:58 2025 From: mariasde at redhat.com (=?UTF-8?Q?Mar=C3=ADa_Arias_de_Reyna_Dominguez?=) Date: Tue, 30 Dec 2025 10:12:58 +0100 Subject: Initialization code that never got trained Message-ID: Happy New Year! I have been doing some experiments with Leyden and realized something: there is some code at startup/initialization that never gets optimized but is impacting on startup and warmup time. This was a realization while doing comparisons with native/graalvm images of the same code. For example: a REST API. It has some initialization, port opening, reading configurations, etc... that run only once. So the code will never be trained. But it always runs at startup, impacting the time to first response. Compared to a native image, the native image may not have it optimized, but at least it is already compiled, not interpreted. Therefore, the native image starts faster. So, how can I tell Leyden to please compile and cache those functions, even if they are going to be run just once, even if they are not optimized at all, even if those compilations can get discarded after a couple of seconds? Or are we just going to assume that that code, which is impacting startup time, doesn't need to be pre-compiled because we are focusing only on optimizations made by the JVM on runtime? Kind regards, Mar?a Arias de Reyna Dom?nguez Senior Software Engineer She / Her / Hers ariasdereyna at redhat.com -------------- next part -------------- An HTML attachment was scrubbed... URL: