From sparasa at openjdk.org Tue Jul 1 00:01:59 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 1 Jul 2025 00:01:59 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt [v2] In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: <27l1noh4qLvBGFOqhDNxmv-Ikyuc8AOQNRgIT4RtbZM=.5c199ba5-a2a7-4e98-9459-68ed4c55b73f@github.com> On Fri, 27 Jun 2025 01:43:16 GMT, Mohamed Issa wrote: >> The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. >> >> 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. >> 2. If these special values are found, return immediately with minimal modifications to the result register. >> 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). >> >> The commands to run all relevant micro-benchmarks are posted below. >> >> `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` >> `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` >> >> The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. >> >> Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. >> >> | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | >> | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | >> | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | >> | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | >> | [0] | 344990 | 627561 | +81.91 | >> | [-0] ... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Ensure ABS_MASK is a 128-bit memory sized location and only use equal enum for UCOMISD checks I did independent testing by running the correctness tests and performance benchmarks. The change looks good to me. Thanks, Vamsi ------------- Marked as reviewed by sparasa (Author). PR Review: https://git.openjdk.org/jdk/pull/25962#pullrequestreview-2973095482 From haosun at openjdk.org Tue Jul 1 02:54:47 2025 From: haosun at openjdk.org (Hao Sun) Date: Tue, 1 Jul 2025 02:54:47 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References:

Message-ID: On Mon, 30 Jun 2025 13:25:09 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - cleanup: address nits, rename several symbols > - cleanup: remove unreferenced definitions > - Address review comments. > > - fixup: disable FP mul reduction auto-vectorization for all targets > - fixup: add a tmp vReg to reduce_mul_integral_gt128b and > reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified > - cleanup: replace a complex lambda in the above methods with a loop > - cleanup: rename symbols to follow the existing naming convention > - cleanup: add asserts to SVE only instructions > - split mul FP reduction instructions into strictly-ordered (default) > and explicitly non strictly-ordered > - remove redundant conditions in TestVectorFPReduction.java > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > | Benchmark | Before | After | Units | Diff | > |---------------------------|----------|----------|--------|-------| > | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | > | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | > | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | > | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | > | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | > | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% | > - Merge branch 'master' into 8343689-rebase > - fixup: don't modify the value in vsrc > > Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this > change, the result of recursive folding is held in vtmp1. To be able to > pass this intermediate result to reduce_mul_integral_le128b(), we would > have to use another temporary FloatRegister, as vtmp1 would essentially > act as vsrc. It's possible to get around this however: > reduce_mul_integral_le128b() is modified so it's possible to pass > matching vsrc and vtmp2 arguments. By doing this, we save ourselves a > temporary register in rules that match to reduce_mul_integral_gt128b(). > - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating > - Use EXT instead of COMPACT to split a vector into two halves > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > Short... src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3729: > 3727: #undef INSN > 3728: > 3729: // SVE aliases In the inital commit, asm test for `sve_(mov|movs|not|nots)` is added into `test/hotspot/gtest/aarch64/aarch64-asmtest.py`. Since the definition is removed in this commit, the corresponding asm test should be removed as well. Otherwise, JDK build failed on AArch64. See the error log in GHA test. https://github.com/mikabl-arm/jdk/actions/runs/15974069085/job/45051902618 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176310497 From xgong at openjdk.org Tue Jul 1 06:04:29 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:04:29 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors Message-ID: ### Background On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. ### Impact Analysis #### 1. Vector types Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. #### 2. Vector API No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. #### 3. Auto-vectorization Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. #### 4. Codegen of vector nodes NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. Details: - Lanewise vector operations are unaffected as explained above. - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_supported_vector()` would be beneficial. - Missing codegen support for type conversions with 32-bit input or output vector size should be added. ### Main changes: - Support 2 shorts vector types. The supported min vector element count for each basic type is: - `T_BOOLEAN`: 2 - `T_BYTE`: 4 - `T_CHAR`: 4 - `T_SHORT`: 2 (new supported) - `T_INT`/`T_FLOAT`/`T_LONG`/`T_DOUBLE`: 2 - Add codegen support for `Vector[U]Cast` with 32-bit input or output vector size. `VectorReinterpret` has already considered the 32-bit vector size cases. - Unsupport reductions with less than 8 bytes vector size explicitly. - Add additional IR tests for Vector API type conversions. - Add JMH benchmark for auto-vectorization with two 16-bit lanes. ### Test Tested hotspot/jdk/langtools - all tests passed. ### Performance Following shows the performance improvement of relative VectorAPI JMHs on a NVIDIA Grace (128-bit SVE2) machine: Benchmark SIZE Mode Unit Before After Gain VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 731.529 26278.599 35.92 VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 366.461 10595.767 28.91 VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 315.791 14327.682 45.37 VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 158.485 7261.847 45.82 VectorZeroExtend.short2Long 128 thrpt ops/ms 1447.243 898666.972 620.95 And here is the performance improvement of the added JMH on Grace: Benchmark LEN Mode Unit Before After Gain VectorTwoShorts.addVec2S 64 avgt ns/op 20.948 12.683 1.65 VectorTwoShorts.addVec2S 128 avgt ns/op 40.073 22.703 1.76 VectorTwoShorts.addVec2S 512 avgt ns/op 157.447 83.691 1.88 VectorTwoShorts.addVec2S 1024 avgt ns/op 313.022 165.085 1.89 VectorTwoShorts.mulVec2S 64 avgt ns/op 20.981 12.647 1.65 VectorTwoShorts.mulVec2S 128 avgt ns/op 40.279 22.637 1.77 VectorTwoShorts.mulVec2S 512 avgt ns/op 158.642 83.371 1.90 VectorTwoShorts.mulVec2S 1024 avgt ns/op 314.788 165.205 1.90 VectorTwoShorts.reverseBytesVec2S 64 avgt ns/op 17.739 9.106 1.94 VectorTwoShorts.reverseBytesVec2S 128 avgt ns/op 32.591 15.632 2.08 VectorTwoShorts.reverseBytesVec2S 512 avgt ns/op 126.154 55.284 2.28 VectorTwoShorts.reverseBytesVec2S 1024 avgt ns/op 254.592 107.457 2.36 We can observe the similar uplift on an AArch64 N1 (NEON) machine. ------------- Commit messages: - 8359419: AArch64: Relax min vector length to 32-bit for short vectors Changes: https://git.openjdk.org/jdk/pull/26057/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8359419 Stats: 306 lines in 8 files changed: 196 ins; 9 del; 101 mod Patch: https://git.openjdk.org/jdk/pull/26057.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26057/head:pull/26057 PR: https://git.openjdk.org/jdk/pull/26057 From xgong at openjdk.org Tue Jul 1 06:09:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:09:44 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References:

Message-ID: On Wed, 25 Jun 2025 09:16:48 GMT, Xiaohong Gong wrote: >> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below: >> >> Benchmark Mode Cnt Unit SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.31... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Address review comments > - Merge 'jdk:master' into JDK-8355563 > - 8355563: VectorAPI: Refactor current implementation of subword gather load API Ping again! Thanks in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3021961883 From dfenacci at openjdk.org Tue Jul 1 06:25:42 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 1 Jul 2025 06:25:42 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v3] In-Reply-To: References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: On Mon, 30 Jun 2025 08:58:07 GMT, Manuel H?ssig wrote: >> After integrating #25872 the calculation of the`CICompilerCount` ergonomic became dependent on the size of `NonNMethodCodeHeapSize`, which itself is an ergonomic based on the available memory. Thus, depending on the system, the test `compiler/arguments/TestCompilerCounts.java` failed, i.e. locally this failed, but not on CI servers. >> >> This PR changes the test to reflect the changes introduced in #25872. >> >> Testing: >> - [ ] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15932906313) >> - [x] tier1,tier2 plus Oracle internal testing > > Manuel H?ssig has updated the pull request incrementally with two additional commits since the last revision: > > - Remove superfluous newline > - Add copyright Looks good to me. Thanks! ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/26024#pullrequestreview-2973682287 From xgong at openjdk.org Tue Jul 1 06:27:43 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:27:43 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References:

Message-ID: On Mon, 30 Jun 2025 12:05:08 GMT, Mikhail Ablakatov wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2087: >> >>> 2085: assert(vector_length_in_bytes > FloatRegister::neon_vl, "ASIMD impl should be used instead"); >>> 2086: assert(vector_length_in_bytes <= FloatRegister::sve_vl_max, "unsupported vector length"); >>> 2087: assert(is_power_of_2(vector_length_in_bytes), "unsupported vector length"); >> >> Better to compare with `MaxVectorSize`. >> >> I suggest using `assert(length_in_bytes == MaxVectorSize, "invalid vector length");` and putting this assertion in `aarch64_vector.ad` file, i.e. inside the matching rule. > > Why is it better that way? Currently the assertions check that we end up here if there computations that can be done only using SVE (length > neon && length <= sve). What would happen if a user operates 256b VectorAPI vectors on a 512b SVE platform? That would be the operations with partial vector size valid. For such cases, we will generate a mask in IR level, and a `VectorBlend` will be generated for this reduction case. Otherwise the result will be incorrect. So the vector size should be equal to MaxVectorSize theoretically. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176524365 From xgong at openjdk.org Tue Jul 1 06:27:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:27:44 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References: <2jvFY4hq9FPdk9e4Zg6LRPdRVhDTGgxofL-we8c-mns=.4e6ce509-67a4-4e46-a661-2b0951f88731@github.com>

Message-ID: On Mon, 30 Jun 2025 12:20:19 GMT, Mikhail Ablakatov wrote: >> I have the same concern about the order issue with @eme64. >> Should we only enable this only for VectorAPI case, which doesn't require strict-order? > > FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067 to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is `VectorReduction2.WithSuperword.intMulBig` (which is fine since it's an integer benchmark). Yes, these operations are disabled for SLP. But maybe we could add an assertion to check the restrict flag in the match rules. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176528442 From epeter at openjdk.org Tue Jul 1 06:30:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 06:30:44 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt [v2] In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: On Fri, 27 Jun 2025 01:43:16 GMT, Mohamed Issa wrote: >> The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. >> >> 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. >> 2. If these special values are found, return immediately with minimal modifications to the result register. >> 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). >> >> The commands to run all relevant micro-benchmarks are posted below. >> >> `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` >> `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` >> >> The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. >> >> Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. >> >> | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | >> | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | >> | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | >> | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | >> | [0] | 344990 | 627561 | +81.91 | >> | [-0] ... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Ensure ABS_MASK is a 128-bit memory sized location and only use equal enum for UCOMISD checks Did not review the patch in detail, but looks reasonable. Tests are passing on my end with commit 3 / v01. @missa-prime Thanks for taking care of this! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25962#pullrequestreview-2973696349 From epeter at openjdk.org Tue Jul 1 06:38:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 06:38:45 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 06:07:03 GMT, Xiaohong Gong wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: >> >> - Address review comments >> - Merge 'jdk:master' into JDK-8355563 >> - 8355563: VectorAPI: Refactor current implementation of subword gather load API > > Ping again! Thanks in advance! @XiaohongGong I'm a little busy at the moment, and soon going on a summer vacation, so I cannot promise a full review soon. Feel free to ask someone else to have a look. I quickly looked through your new benchmark results you published after integration of https://github.com/openjdk/jdk/pull/25539. There seem to still be a few cases where `Gain < 1`. Especially: GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92 GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90 GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90 and GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96 GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95 GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96 and GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95 Do you know what happens in those cases? That said: https://github.com/openjdk/jdk/pull/25539 seems to have been quite the sucess, there are way fewer regressions now than before ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022057434 From xgong at openjdk.org Tue Jul 1 06:43:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:43:44 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References:

Message-ID: <7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> On Tue, 1 Jul 2025 06:07:03 GMT, Xiaohong Gong wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: >> >> - Address review comments >> - Merge 'jdk:master' into JDK-8355563 >> - 8355563: VectorAPI: Refactor current implementation of subword gather load API > > Ping again! Thanks in advance! > @XiaohongGong I'm a little busy at the moment, and soon going on a summer vacation, so I cannot promise a full review soon. Feel free to ask someone else to have a look. > > I quickly looked through your new benchmark results you published after integration of #25539. There seem to still be a few cases where `Gain < 1`. Especially: > > ``` > GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92 > GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90 > GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90 > ``` > > and > > ``` > GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96 > GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95 > GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96 > ``` > > and > > ``` > GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95 > ``` > > Do you know what happens in those cases? Thanks for your input! Yes, I spent some time making an analysis on these little regressions. Seems there are the architecture HW influences like the cache miss or code alignment. I tried with a larger loop alignment like 32, and the performance will be improved and regressions are gone. Since I'm not quite familiar with X86 architectures, I'm not sure of the exact point. Any suggestions on that? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022088710 From jbhateja at openjdk.org Tue Jul 1 06:45:42 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 06:45:42 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Thu, 19 Jun 2025 05:15:40 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. > > In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. > > Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. src/hotspot/cpu/x86/macroAssembler_x86.cpp line 800: > 798: void MacroAssembler::push(Register src, bool is_pair) { > 799: if (is_pair && VM_Version::supports_apx_f()) { > 800: pushp(src); What does is_pair signify here ? You are just pushing one register. Do you intend to use has_matching_pop ? src/hotspot/cpu/x86/macroAssembler_x86.cpp line 807: > 805: > 806: void MacroAssembler::pop(Register dst, bool is_pair) { > 807: if (is_pair && VM_Version::supports_apx_f()) { Same as above, new argument suggestion: please use has_matching_push. I understand your purpose here is to delegate the responsibility of balancing of PPX pair to the user. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2176508727 PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2176511119 From jbhateja at openjdk.org Tue Jul 1 06:45:42 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 06:45:42 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 06:11:29 GMT, Jatin Bhateja wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 807: > >> 805: >> 806: void MacroAssembler::pop(Register dst, bool is_pair) { >> 807: if (is_pair && VM_Version::supports_apx_f()) { > > Same as above, new argument suggestion: please use has_matching_push. > I understand your purpose here is to delegate the responsibility of balancing of PPX pair to the user. For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker in the stub snippets using push/pop instruction sequence and wrap the actual assembler call underneath. The idea here is to catch the balancing error upfront as PPX is purely a performance hint. Instructions with this hint have the same functional semantics as those without. PPX hints set by the compiler that violate the balancing rule may turn off the PPX optimization, but they will not affect program semantics.. class APXPushPopPairTracker { private: int _counter; public: APXPushPopPairTracker() _counter(0) { } ~APXPushPopPairTracker() { assert(_counter == 0, "Push/pop pair mismatch"); } void push(Register reg, bool has_matching_pop) { if (has_matching_pop && VM_Version::supports_apx_f()) { Assembler::pushp(reg); incrementCounter(); } else { Assembler::push(reg); } } void pop(Register reg, bool has_matching_push) { if (has_matching_push && VM_Version::supports_apx_f()) { Assembler::popp(reg); decrementCounter(); } else { Assembler::pop(reg); } } void incrementCounter() { _counter++; } void decrementCounter() { _counter--; } } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2176564840 From mhaessig at openjdk.org Tue Jul 1 06:50:46 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 06:50:46 GMT Subject: RFR: 8361092: Remove trailing spaces in x86 ad files In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 15:34:18 GMT, Manuel H?ssig wrote: > This PR fixes some trailing spaces in `x86_64.ad`. > > Testing: > - [ ] Github Actions Thank you for your reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26048#issuecomment-3022106129 From mhaessig at openjdk.org Tue Jul 1 06:50:47 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 06:50:47 GMT Subject: Integrated: 8361092: Remove trailing spaces in x86 ad files In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 15:34:18 GMT, Manuel H?ssig wrote: > This PR fixes some trailing spaces in `x86_64.ad`. > > Testing: > - [ ] Github Actions This pull request has now been integrated. Changeset: b32ccf2c Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/b32ccf2cb23e0180187f4238140583a923fc27c4 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod 8361092: Remove trailing spaces in x86 ad files Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/26048 From mhaessig at openjdk.org Tue Jul 1 06:52:32 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 06:52:32 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v4] In-Reply-To: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: > After integrating #25872 the calculation of the`CICompilerCount` ergonomic became dependent on the size of `NonNMethodCodeHeapSize`, which itself is an ergonomic based on the available memory. Thus, depending on the system, the test `compiler/arguments/TestCompilerCounts.java` failed, i.e. locally this failed, but not on CI servers. > > This PR changes the test to reflect the changes introduced in #25872. > > Testing: > - [ ] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15932906313) > - [x] tier1,tier2 plus Oracle internal testing Manuel H?ssig has updated the pull request incrementally with one additional commit since the last revision: Fix whitespace Co-authored-by: Andrey Turbanov ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26024/files - new: https://git.openjdk.org/jdk/pull/26024/files/8beb5898..71767802 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26024&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26024&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26024.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26024/head:pull/26024 PR: https://git.openjdk.org/jdk/pull/26024 From mhaessig at openjdk.org Tue Jul 1 06:52:33 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 06:52:33 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v3] In-Reply-To: References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com>

Message-ID: On Mon, 30 Jun 2025 19:48:44 GMT, Andrey Turbanov wrote: >> Manuel H?ssig has updated the pull request incrementally with two additional commits since the last revision: >> >> - Remove superfluous newline >> - Add copyright > > test/hotspot/jtreg/compiler/arguments/TestCompilerCounts.java line 159: > >> 157: // Tiered modes >> 158: int tieredCount = heuristicCount(cpus, Compilation.Tiered, debug); >> 159: pass(tieredCount, opt, "-XX:NonNMethodCodeHeapSize=" + NonNMethodCodeHeapSize); > > Suggestion: > > pass(tieredCount, opt, "-XX:NonNMethodCodeHeapSize=" + NonNMethodCodeHeapSize); Good catch, thank you. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26024#discussion_r2176568786 From epeter at openjdk.org Tue Jul 1 06:55:41 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 06:55:41 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: <7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> References:

Message-ID: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> On Mon, 30 Jun 2025 13:25:09 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - cleanup: address nits, rename several symbols > - cleanup: remove unreferenced definitions > - Address review comments. > > - fixup: disable FP mul reduction auto-vectorization for all targets > - fixup: add a tmp vReg to reduce_mul_integral_gt128b and > reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified > - cleanup: replace a complex lambda in the above methods with a loop > - cleanup: rename symbols to follow the existing naming convention > - cleanup: add asserts to SVE only instructions > - split mul FP reduction instructions into strictly-ordered (default) > and explicitly non strictly-ordered > - remove redundant conditions in TestVectorFPReduction.java > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > | Benchmark | Before | After | Units | Diff | > |---------------------------|----------|----------|--------|-------| > | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | > | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | > | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | > | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | > | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | > | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% | > - Merge branch 'master' into 8343689-rebase > - fixup: don't modify the value in vsrc > > Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this > change, the result of recursive folding is held in vtmp1. To be able to > pass this intermediate result to reduce_mul_integral_le128b(), we would > have to use another temporary FloatRegister, as vtmp1 would essentially > act as vsrc. It's possible to get around this however: > reduce_mul_integral_le128b() is modified so it's possible to pass > matching vsrc and vtmp2 arguments. By doing this, we save ourselves a > temporary register in rules that match to reduce_mul_integral_gt128b(). > - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating > - Use EXT instead of COMPACT to split a vector into two halves > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > Short... src/hotspot/cpu/aarch64/aarch64_vector.ad line 3536: > 3534: > 3535: instruct reduce_mulF_gt128b(vRegF dst, vRegF fsrc, vReg vsrc, vReg tmp) %{ > 3536: predicate(Matcher::vector_length_in_bytes(n->in(2)) > 16 && n->as_Reduction()->requires_strict_order()); Are there the cases that can match with this rule? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2097: > 2095: sve_movprfx(vtmp1, vsrc); // copy > 2096: sve_ext(vtmp1, vtmp1, vector_length_in_bytes / 2); // swap halves > 2097: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); // multiply halves > sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); Can we use `ptrue` instread of `pgtmp` here? The higher bits can be computed, but they have not influences to the final results, right? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2106: > 2104: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vtmp2); // multiply halves > 2105: vector_length_in_bytes = vector_length_in_bytes / 2; > 2106: vector_length = vector_length / 2; I guess you want to update the `pgtmp` with new `vector_length`? But seems the code is missing. Anyway, maybe the it's not necessary to generate a predicate as I commented above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176590314 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176584327 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176587011 From xgong at openjdk.org Tue Jul 1 07:10:41 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 07:10:41 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: <7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> References:

<7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> Message-ID: On Tue, 1 Jul 2025 06:41:32 GMT, Xiaohong Gong wrote: >> Ping again! Thanks in advance! > >> @XiaohongGong I'm a little busy at the moment, and soon going on a summer vacation, so I cannot promise a full review soon. Feel free to ask someone else to have a look. >> >> I quickly looked through your new benchmark results you published after integration of #25539. There seem to still be a few cases where `Gain < 1`. Especially: >> >> ``` >> GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92 >> GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90 >> GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90 >> ``` >> >> and >> >> ``` >> GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96 >> GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95 >> GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96 >> ``` >> >> and >> >> ``` >> GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95 >> ``` >> >> Do you know what happens in those cases? > > Thanks for your input! Yes, I spent some time making an analysis on these little regressions. Seems there are the architecture HW influences like the cache miss or code alignment. I tried with a larger loop alignment like 32, and the performance will be improved and regressions are gone. Since I'm not quite familiar with X86 architectures, I'm not sure of the exact point. Any suggestions on that? > @XiaohongGong Maybe someone from Intel (@jatin-bhateja @sviswa7) can help you with the x86 specific issues. You could always use hardware counters to measure cache misses. Also if the vectors are not cache-line aligned, there may be split loads or stores. Also that can be measured with hardware counters. Maybe the benchmark needs to be improved somehow, to account for issues with alignment. I also tried to measure cache misses with perf on my x86 machine, and I noticed the cache miss is increased. The generated code layout of the test/benchmark is changed with my changes in Java side, so I guess maybe the alignment is different with before. To verify my thought, I used the vm option `-XX:OptoLoopAlignment=32`, and the performance can be improved a lot compared with the version without my change. So I think the patch itself maybe acceptable even we noticed minor regressions. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022195040 From bmaillard at openjdk.org Tue Jul 1 07:11:42 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 1 Jul 2025 07:11:42 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v2] In-Reply-To: <0MJe_8nA-ILWqoVG-9rzuq5Pe9xX-FG2LN3k9Cy8nqU=.d724c6cf-cb02-45c4-95a4-5bd1fef7462b@github.com> References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> <3cLLB7fms3S4WgqOVeb7D_ZDRFsJ_-ca3qfALlmzFeU=.1002ac91-1e35-4499-9d88-6d1f76c955d0@github.com> <0MJe_8nA-ILWqoVG-9rzuq5Pe9xX-FG2LN3k9Cy8nqU=.d724c6cf-cb02-45c4-95a4-5bd1fef7462b@github.com> Message-ID: On Mon, 30 Jun 2025 13:52:01 GMT, Emanuel Peter wrote: > @benoitmaillard Very nice work, and great description :) Thank you! > > Did you check if this allows enabling any of the other disabled verifications from [JDK-8347273](https://bugs.openjdk.org/browse/JDK-8347273)? > > That may be a lot of work. Not sure if it is worth checking all of them now. @TobiHartmann how much should he invest in this now? An alternative is just tackling all the other cases later. What do you think? I have started to take a look at this and it seems that there are a lot of cases to check indeed. > @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? Yes, good point, I should I have mentioned this somewhere. The `phase->type(in(2))` call uses the type array from `PhaseValues`. The type array entry is actually modified earlier, in `PhaseCCP::analyze`, right after the `Value` call. You can see the `set_type` call [here](https://github.com/benoitmaillard/jdk/blob/75de51dff6d9cc3e9764737b29b9358992b488b7/src/hotspot/share/opto/phaseX.cpp#L2765). When this happens, users are added to the (local) worklist but again it does not change our issue as only value optimizations occur in that context. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26017#issuecomment-3022192988 From shade at openjdk.org Tue Jul 1 07:41:40 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 07:41:40 GMT Subject: RFR: 8360783: CTW: Skip deoptimization between tiers [v2] In-Reply-To: References:

Message-ID: On Fri, 27 Jun 2025 08:38:31 GMT, Aleksey Shipilev wrote: >> When profiling CTW runs, I noticed we spend a lot of time dealing with deoptimization. We do this excessively, deoptimizing before compilation on every tier. This is excessive: Hotspot honors compilation requests on subsequent levels without the need for explicit deoptimization. Not doing deopt between tiers greatly improves CTW performance. >> >> A taste of improvements, about 15% less CPU spent: >> >> >> $ time make test TEST=applications/ctw/modules >> >> # Current >> real 5m1.616s >> user 79m41.398s >> sys 14m39.607s >> >> # Patched >> real 3m55.411s >> user 69m19.227s >> sys 5m24.323s >> >> >> The compilation still works as expected, progressing through tiers 1..4: >> >> >> $ JAVA_OPTIONS="-XX:+PrintCompilation -XX:CICompilerCount=2" ./ctw.sh modules:jdk.compiler | tee out >> ... >> $ grep sun.tools.serialver.resources.serialver_de::getContents out >> 101783 55033 b 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101785 55036 b 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101786 55033 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101786 55038 b 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101787 55036 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101792 55040 b 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101797 55038 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101798 55040 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: marked for deoptimization > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java > > Co-authored-by: Tobias Hartmann Aw. Thanks! Here goes again. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26013#issuecomment-3022410574 From shade at openjdk.org Tue Jul 1 08:02:45 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 08:02:45 GMT Subject: Integrated: 8360783: CTW: Skip deoptimization between tiers In-Reply-To: References: Message-ID: On Fri, 27 Jun 2025 08:19:34 GMT, Aleksey Shipilev wrote: > When profiling CTW runs, I noticed we spend a lot of time dealing with deoptimization. We do this excessively, deoptimizing before compilation on every tier. This is excessive: Hotspot honors compilation requests on subsequent levels without the need for explicit deoptimization. Not doing deopt between tiers greatly improves CTW performance. > > A taste of improvements, about 15% less CPU spent: > > > $ time make test TEST=applications/ctw/modules > > # Current > real 5m1.616s > user 79m41.398s > sys 14m39.607s > > # Patched > real 3m55.411s > user 69m19.227s > sys 5m24.323s > > > The compilation still works as expected, progressing through tiers 1..4: > > > $ JAVA_OPTIONS="-XX:+PrintCompilation -XX:CICompilerCount=2" ./ctw.sh modules:jdk.compiler | tee out > ... > $ grep sun.tools.serialver.resources.serialver_de::getContents out > 101783 55033 b 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) > 101785 55036 b 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) > 101786 55033 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used > 101786 55038 b 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) > 101787 55036 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used > 101792 55040 b 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) > 101797 55038 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used > 101798 55040 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: marked for deoptimization This pull request has now been integrated. Changeset: cd6caedd Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/cd6caedd0a3c9ebd4c8c57e64f62b60161c5cd7c Stats: 8 lines in 1 file changed: 6 ins; 1 del; 1 mod 8360783: CTW: Skip deoptimization between tiers Reviewed-by: thartmann, mhaessig, dfenacci ------------- PR: https://git.openjdk.org/jdk/pull/26013 From eastigeevich at openjdk.org Tue Jul 1 08:08:49 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 08:08:49 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References:

Message-ID: <0TjXtL5ABEBUwmu1VlJ9kNDs95zi8HGA-S2A0BU9GeY=.2fa893f4-96c4-4761-91b9-3b6250212c7a@github.com> On Thu, 26 Jun 2025 16:20:44 GMT, Chad Rakoczy wrote: >> src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp line 90: >> >>> 88: // Patch the constant in the call's trampoline stub. >>> 89: address trampoline_stub_addr = get_trampoline(); >>> 90: if (trampoline_stub_addr != nullptr && dest != trampoline_stub_addr) { >> >> I think you will not need the checks if you rewrite the code as follows: >> ```c++ >> address addr_call = ...; >> assert(); >> >> if (!Assembler::reachable_from_branch_at(addr_call, dest)) { >> address trampoline_stub_addr = get_trampoline(); >> assert (trampoline_stub_addr != nullptr, "we need a trampoline"); >> assert (! is_NativeCallTrampolineStub_at(dest), "chained trampolines"); >> nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(dest); >> dest = trampoline_stub_addr; >> } >> set_destination(dest); >> ICache::invalidate_range(addr_call, instruction_size); >> >> >> If `dest` is a trampoline in the current nmethod, it is always reachable. So you will not go into setting trampoline's target to itself. Also we will call `get_trampoline`, which involves `CodeCache::find_blob` and ` a traversal of relocations, only if we need a trampoline. > > I would need to check the assumptions that other callers make about this function. In the current state it updates the trampoline regardless if the branch is reachable or not. With your change it would require the caller to also update the trampoline to make sure it is not stale. @theRealAph When we don't need a trampoline (a call site is a direct call), we update the trampoline to have the same destination as the call site. I have not found places in Hotspot relying on this. Do you remember why we are doing this? Is it Ok not to update trampolines in the case of reachable destinations? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176748370 From aph at openjdk.org Tue Jul 1 08:13:41 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:13:41 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 05:59:15 GMT, Xiaohong Gong wrote: > ### Background > On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. > > For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. > > To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. > > ### Impact Analysis > #### 1. Vector types > Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. > > #### 2. Vector API > No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. > > #### 3. Auto-vectorization > Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. > > #### 4. Codegen of vector nodes > NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. > > Details: > - Lanewise vector operations are unaffected as explained above. > - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). > - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_s... src/hotspot/cpu/aarch64/aarch64.ad line 2371: > 2369: switch(bt) { > 2370: case T_BOOLEAN: > 2371: // It needs to load/store a vector mask with only 2 elements Suggestion: // Load/store a vector mask with only 2 elements Same with the other cases. src/hotspot/cpu/aarch64/aarch64.ad line 2386: > 2384: break; > 2385: default: > 2386: // Limit the min vector length to 64-bit normally. Suggestion: // Limit the min vector length to 64-bit. src/hotspot/cpu/aarch64/aarch64_vector.ad line 199: > 197: case Op_MaxReductionV: > 198: // Reductions with less than 8 bytes vector length are > 199: // not supported for now. Suggestion: // not supported. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2176759967 PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2176761846 PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2176762709 From aph at openjdk.org Tue Jul 1 08:30:48 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:30:48 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: <0TjXtL5ABEBUwmu1VlJ9kNDs95zi8HGA-S2A0BU9GeY=.2fa893f4-96c4-4761-91b9-3b6250212c7a@github.com> References:

<0TjXtL5ABEBUwmu1VlJ9kNDs95zi8HGA-S2A0BU9GeY=.2fa893f4-96c4-4761-91b9-3b6250212c7a@github.com> Message-ID: On Tue, 1 Jul 2025 08:05:50 GMT, Evgeny Astigeevich wrote: > @theRealAph When we don't need a trampoline (a call site is a direct call), we update the trampoline to have the same destination as the call site. Yes, that's fundamental to the design. > I have not found places in Hotspot relying on this. Do you remember why we are doing this? Is it Ok not to update trampolines in the case of reachable destinations? No. We always keep the trampoline up to date so that we don't have to deal with a race condition when patching trampoline calls. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176812614 From aph at openjdk.org Tue Jul 1 08:34:48 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:34:48 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References:

<0TjXtL5ABEBUwmu1VlJ9kNDs95zi8HGA-S2A0BU9GeY=.2fa893f4-96c4-4761-91b9-3b6250212c7a@github.com> Message-ID: On Tue, 1 Jul 2025 08:28:00 GMT, Andrew Haley wrote: >> @theRealAph When we don't need a trampoline (a call site is a direct call), we update the trampoline to have the same destination as the call site. I have not found places in Hotspot relying on this. >> Do you remember why we are doing this? Is it Ok not to update trampolines in the case of reachable destinations? > >> @theRealAph When we don't need a trampoline (a call site is a direct call), we update the trampoline to have the same destination as the call site. > > Yes, that's fundamental to the design. > >> I have not found places in Hotspot relying on this. Do you remember why we are doing this? Is it Ok not to update trampolines in the case of reachable destinations? > > No. We always keep the trampoline up to date so that we don't have to deal with a race condition when patching trampoline calls. Please read the comments which begin: `AArch64 OpenJDK uses four different types of calls:` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176825935 From xgong at openjdk.org Tue Jul 1 08:35:42 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 08:35:42 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 08:10:16 GMT, Andrew Haley wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > src/hotspot/cpu/aarch64/aarch64.ad line 2371: > >> 2369: switch(bt) { >> 2370: case T_BOOLEAN: >> 2371: // It needs to load/store a vector mask with only 2 elements > > Suggestion: > > // Load/store a vector mask with only 2 elements > > Same with the other cases. Thanks so much for your comment. I will fix them soon. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2176831961 From aph at openjdk.org Tue Jul 1 08:40:55 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:40:55 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References:

Message-ID: On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e src/hotspot/cpu/aarch64/relocInfo_aarch64.cpp line 117: > 115: } > 116: > 117: void poll_Relocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest, bool is_nmethod_relocation) { Suggestion: void poll_Relocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest, bool) { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176861287 From mhaessig at openjdk.org Tue Jul 1 09:11:32 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 09:11:32 GMT Subject: RFR: 8308094: Add a compilation timeout flag to catch long running compilations [v2] In-Reply-To: References: Message-ID: <_Ye19u_7PlqlsoRSuR0dNeAGbeuHyN_oqD1ZS4q9Nvk=.b94fd29d-d43e-4561-9926-7f5a46434d8e@github.com> > This PR adds `-XX:CompileTaskTimeout` on Linux to limit the amount of time a compilation task can run. The goal of this is initially to be able to find and investigate long-running compilations. > > The timeout is implemented using a POSIX timer that sends a `SIGALRM` to the compiler thread the compile task is running on. Each compiler thread registers a signal handler that triggers an assert upon receiving `SIGALRM`. This is currently only implemented for Linux, because it relies on `SIGEV_THREAD_ID` to get the signal delivered to the same thread that timed out. > > Since `SIGALRM` is now used, the test `runtime/signal/TestSigalrm.java` now requires `vm.flagless` so it will not interfere with the compiler thread signal handlers. > > Testing: > - [ ] Github Actions > - [x] tier1, tier2 on all platforms > - [x] tier3, tier4 and Oracle internal testing on Linux fastdebug > - [x] tier1 through tier4 with `-XX:CompileTaskTimeout=60000` (one minute timeout) to see what fails (`compiler/codegen/TestAntiDependenciesHighMemUsage2.java`, `compiler/loopopts/TestMaxLoopOptsCountReached.java`, and `compiler/c2/TestScalarReplacementMaxLiveNodes.java` fail) Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8308094-timeout - Fix SIGALRM test - Add timeout functionality to compiler threads ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26023/files - new: https://git.openjdk.org/jdk/pull/26023/files/09e0e58c..5840cc2e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26023&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26023&range=00-01 Stats: 4936 lines in 244 files changed: 2913 ins; 773 del; 1250 mod Patch: https://git.openjdk.org/jdk/pull/26023.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26023/head:pull/26023 PR: https://git.openjdk.org/jdk/pull/26023 From duke at openjdk.org Tue Jul 1 09:20:43 2025 From: duke at openjdk.org (duke) Date: Tue, 1 Jul 2025 09:20:43 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt [v2] In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: On Fri, 27 Jun 2025 01:43:16 GMT, Mohamed Issa wrote: >> The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. >> >> 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. >> 2. If these special values are found, return immediately with minimal modifications to the result register. >> 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). >> >> The commands to run all relevant micro-benchmarks are posted below. >> >> `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` >> `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` >> >> The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. >> >> Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. >> >> | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | >> | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | >> | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | >> | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | >> | [0] | 344990 | 627561 | +81.91 | >> | [-0] ... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Ensure ABS_MASK is a 128-bit memory sized location and only use equal enum for UCOMISD checks @missa-prime Your change (at version 615169d8aa679c665ac4c5ad30ea011505e503b7) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25962#issuecomment-3022902863 From mhaessig at openjdk.org Tue Jul 1 09:34:50 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 09:34:50 GMT Subject: RFR: 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string In-Reply-To: References: Message-ID: <5kdHAQ86j5eDq6OgIb6Bn7HFWxgc24W8ywubudeGa-Q=.5d8b392a-de5c-49d7-a3f2-3ade541c6643@github.com> On Mon, 30 Jun 2025 16:14:08 GMT, Kim Barrett wrote: > Please review this trivial fix of a format string. The value being printed is > TieredStopAtLevel, which is of type intx, so "%zd" should be used instead of "%d". > > Testing: mach5 tier1 Looks good and trivial to me. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26051#pullrequestreview-2974517185 From yzheng at openjdk.org Tue Jul 1 09:38:48 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 1 Jul 2025 09:38:48 GMT Subject: RFR: 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 16:14:08 GMT, Kim Barrett wrote: > Please review this trivial fix of a format string. The value being printed is > TieredStopAtLevel, which is of type intx, so "%zd" should be used instead of "%d". > > Testing: mach5 tier1 LGTM ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/26051#pullrequestreview-2974540441 From eastigeevich at openjdk.org Tue Jul 1 09:51:55 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 09:51:55 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 09:49:08 GMT, Evgeny Astigeevich wrote: >> Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: >> >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Update how call sites are fixed >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Fix pointer printing >> - Use set_destination_mt_safe >> - Print address as pointer >> - Use new _metadata_size instead of _jvmci_data_size >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Only check branch distance for aarch64 and riscv >> - Move far branch fix to fix_relocation_after_move >> - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e > > src/hotspot/share/code/nmethod.cpp line 1547: > >> 1545: CodeBuffer dst(nm_copy); >> 1546: while (iter.next()) { >> 1547: iter.reloc()->fix_relocation_after_move(&src, &dst, true); > > What if, instead of a bool parameter we introduce a function `fix_relocation_after_copy`: > ```c++ > virtual void Relocation::fix_relocation_after_copy(const CodeBuffer* src, CodeBuffer* dest) { > fix_relocation_after_move(src, dest); > } > > void CallRelocation::fix_relocation_after_copy(const CodeBuffer* src, CodeBuffer* dest) { > address orig_addr = old_addr_for(addr(), src, dest); > address callee = pd_call_destination(orig_addr); > > if (src->contains(callee)) { > // If the original call is to an address in the src CodeBuffer (such as a stub call) > // the updated call should be to the corresponding address in dest CodeBuffer > ptrdiff_t offset = callee - orig_addr; > callee = addr() + offset; > } > > pd_set_call_destination(callee); > } > > > With this change we don't need to modify `relocInfo_*.cpp` files. IMO, we might consider moving `pd_set_call_destination` to `CallRelocation` because only CallRelocation uses it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177119955 From shade at openjdk.org Tue Jul 1 10:53:25 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 10:53:25 GMT Subject: RFR: 8361180: Disable CompiledDirectCall verification with -VerifyInlineCaches Message-ID: Missed the spot when doing [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). There is a path from GC that calls into IC verification when cleaning the caches. See `nmethod::cleanup_inline_caches_impl`. It does verification per callsite, and does the whole thing during parallel GC cleanup, which is STW at least in G1. This gets expensive for CTW scenarios. We should wrap that under the same flag introduced by [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). Motivational improvements: $ time CONF=linux-x86_64-server-fastdebug make test TEST=applications/ctw/modules/ # Current mainline real 3m59.274s user 68m9.663s sys 5m19.026s # This PR real 3m49.118s user 65m37.962s sys 5m15.441s ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/26063/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26063&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361180 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26063.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26063/head:pull/26063 PR: https://git.openjdk.org/jdk/pull/26063 From mhaessig at openjdk.org Tue Jul 1 11:23:42 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 11:23:42 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:08:20 GMT, Jatin Bhateja wrote: > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin Hi, @jatin-bhateja. Thank you for providing this fix. I took a look at it and have a question. Otherwise, this looks good. src/hotspot/share/opto/divnode.cpp line 833: > 831: } > 832: > 833: if (g_isfinite(t1->getf()) && t2->getf() == 0.0) { Is the `g_isfinite` for `t1` really needed? If the dividend is infinite then the result is also an infinity with the appropriate sign. Does this not result in `INF / 0.0` being calculated below? This would also be undefined by the C++ standard, would it not? Since as far as I know not all s390 models implement IEEE754, perhaps it would be better to remove the `g_isfinite` to prevent the native `INF / 0.0` below. ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26062#pullrequestreview-2974972341 PR Review Comment: https://git.openjdk.org/jdk/pull/26062#discussion_r2177311121 From eastigeevich at openjdk.org Tue Jul 1 11:26:52 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 11:26:52 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References:

Message-ID: <73AnlXOv0T8K25DgsNdH1PkBjcBXz0f3bBYZx44LpAw=.439f5383-ffd1-44e8-9e11-4b5af9b6a278@github.com> On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e src/hotspot/share/code/nmethod.cpp line 1653: > 1651: } > 1652: } > 1653: } Do we need this code? Shouldn't missing trampolined be caught during fixing call sites? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177325220 From eastigeevich at openjdk.org Tue Jul 1 11:40:54 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 11:40:54 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References:

Message-ID: On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e src/hotspot/share/code/nmethod.hpp line 172: > 170: friend class DeoptimizationScope; > 171: > 172: #define ImmutableDataReferencesCounterSize (int)sizeof(int) Macros defining an expression need to be enclosed in parentheses. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177369434 From epeter at openjdk.org Tue Jul 1 11:56:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 11:56:43 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v3] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: On Mon, 30 Jun 2025 15:42:01 GMT, Beno?t Maillard wrote: >> This PR prevents some missed ideal optimizations in IGVN by notifying users of type refinements made during CCP, addressing a missed optimization that caused a verification failure with `-XX:VerifyIterativeGVN=1110`. >> >> ### Context >> During the compilation of the input program (obtained from the fuzzer, then simplified and added as a test) by C2, we end up with node `591 ModI` that takes `138 Phi` as its divisor input. An existing `Ideal` optimization is to get rid of the control input of a `ModINode` when we can prove that the divisor is never `0`. >> >> In this specific case, the type of the `PhiNode` gets refined during CCP, but the refinement fails to propagate to its users for the IGVN phase and the ideal optimization for the `ModINode` never happens. This results in a missed optimization and hits an assert in the verification phase of IGVN (when using `-XX:VerifyIterativeGVN=1110`). >> >> ![IGV screenshot](https://github.com/user-attachments/assets/5dee1ae6-9146-4115-922d-df33b7ccbd37) >> >> ### Detailed Analysis >> >> In `PhaseCCP::analyze`, we call `Value` for the `PhiNode`, which >> results in a type refinement: the range gets restricted to `int:-13957..-1191`. >> >> ```c++ >> // Pull from worklist; compute new value; push changes out. >> // This loop is the meat of CCP. >> while (worklist.size() != 0) { >> Node* n = fetch_next_node(worklist); >> DEBUG_ONLY(worklist_verify.push(n);) >> if (n->is_SafePoint()) { >> // Make sure safepoints are processed by PhaseCCP::transform even if they are >> // not reachable from the bottom. Otherwise, infinite loops would be removed. >> _root_and_safepoints.push(n); >> } >> const Type* new_type = n->Value(this); >> if (new_type != type(n)) { >> DEBUG_ONLY(verify_type(n, new_type, type(n));) >> dump_type_and_node(n, new_type); >> set_type(n, new_type); >> push_child_nodes_to_worklist(worklist, n); >> } >> if (KillPathsReachableByDeadTypeNode && n->is_Type() && new_type == Type::TOP) { >> // Keep track of Type nodes to kill CFG paths that use Type >> // nodes that become dead. >> _maybe_top_type_nodes.push(n); >> } >> } >> DEBUG_ONLY(verify_analyze(worklist_verify);) >> >> >> At the end of `PhaseCCP::analyze`, we obtain the following types in the side table: >> - `int` for node `591` (`ModINode`) >> - `int:-13957..-1191` for node `138` (`PhiNode`) >> >> If we call `find_node(138)->bottom_type()`, we get: >> - `int` for both nodes >> >> The... > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > Fix bad test class name Nice work @benoitmaillard ! src/hotspot/share/opto/phaseX.cpp line 3124: > 3122: n->raise_bottom_type(t); > 3123: _worklist.push(n); // n re-enters the hash table via the worklist > 3124: add_users_to_worklist(n); // if ideal or identity optimizations depend on the input type, users need to be notified Suggestion: add_users_to_worklist(n); // if Ideal or Identity optimizations depend on the input type, users need to be notified I would make them upper-case, just like the method names. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26017#pullrequestreview-2975094882 PR Review Comment: https://git.openjdk.org/jdk/pull/26017#discussion_r2177396474 From epeter at openjdk.org Tue Jul 1 11:56:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 11:56:44 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v2] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> <3cLLB7fms3S4WgqOVeb7D_ZDRFsJ_-ca3qfALlmzFeU=.1002ac91-1e35-4499-9d88-6d1f76c955d0@github.com> <0MJe_8nA-ILWqoVG-9rzuq5Pe9xX-FG2LN3k9Cy8nqU=.d724c6cf-cb02-45c4-95a4-5bd1fef7462b@github.com> Message-ID: On Tue, 1 Jul 2025 07:07:40 GMT, Beno?t Maillard wrote: > > @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? > > Yes, good point, I should I have mentioned this somewhere. The `phase->type(in(2))` call uses the type array from `PhaseValues`. The type array entry is actually modified earlier, in `PhaseCCP::analyze`, right after the `Value` call. You can see the `set_type` call [here](https://github.com/benoitmaillard/jdk/blob/75de51dff6d9cc3e9764737b29b9358992b488b7/src/hotspot/share/opto/phaseX.cpp#L2765). When this happens, users are added to the (local) worklist but again it does not change our issue as only value optimizations occur in that context. Thanks for the explanation! So it seems that `CCP` and `IGVN` share the type array, right? Ah yes, it is the `Compile::_types`: 461 // Shared type array for GVN, IGVN and CCP. It maps node idx -> Type*. 462 Type_Array* _types; If the value behind `phase->type(in(2))` (the type array entry) is modified in `PhaseCCP::analyze`, right after the `Value` call, then why not do the notification there? If we did that, we would do more notification than what you now proposed (to do the notification in `PhaseCCP::transform_once` on the nodes that have a type that is different than the `bottom_type`). Are we possibly missing any important case with your approach now? Probably not, I would argue: with your approach we still notify for all live nodes that have a modified type, or are replaced with a constant. If we notified after every type update in `PhaseCCP::analyze`, we might notify for nodes multiple times, and we would also notify for nodes that are dead after CCP - both are unnecessary overheads. Alright, I just wanted to think this through - but it seems your approach is good :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26017#issuecomment-3023637471 From bmaillard at openjdk.org Tue Jul 1 12:00:18 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 1 Jul 2025 12:00:18 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal Message-ID: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. ### Testing - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) - [x] tier1-3, plus some internal testing Thank you for reviewing! ------------- Commit messages: - 8361144: remove unintentional line break - 8361144: move hash check after return value check and use same format as unique counter check - 8361144: add check for node hash after verifying ideal Changes: https://git.openjdk.org/jdk/pull/26064/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26064&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361144 Stats: 11 lines in 1 file changed: 10 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26064.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26064/head:pull/26064 PR: https://git.openjdk.org/jdk/pull/26064 From epeter at openjdk.org Tue Jul 1 12:02:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 12:02:16 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v28] In-Reply-To: References:

<7r3C8BAViyHKVVJjv4w0YxfIUkfk9PmY0OEt73V_aRI=.baf51fc4-d996-44d0-a1f5-10cf6dc4de8d@github.com>

Message-ID: On Thu, 12 Jun 2025 15:40:49 GMT, Roland Westrelin wrote: >> @rwestrel Let me know if you want us to run some extra testing. Christian said that you might be planning to wait until the JDK26 fork, and merge then, and then we can run testing. Up to you :) > > @eme64 in case you forgot about that one, it's ready for another round of reviews. @rwestrel I'm quite busy right now. I will soon go on vacation and travel, and I have a presentation to prepare in the next weeks. I hope I can come back to this in early August though. Feel free to ask someone else for a review, I don't want to hold this up. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-3023679612 From eastigeevich at openjdk.org Tue Jul 1 12:07:58 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 12:07:58 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References:

Message-ID: On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e test/hotspot/jtreg/vmTestbase/nsk/jvmti/NMethodRelocation/nmethodrelocation.java line 37: > 35: import jdk.test.whitebox.code.BlobType; > 36: > 37: public class nmethodrelocation extends DebugeeClass { Why is the class name not following the Java code conventions? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177424604 From mbaesken at openjdk.org Tue Jul 1 12:28:39 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 1 Jul 2025 12:28:39 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:08:20 GMT, Jatin Bhateja wrote: > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin With your patch included, the test compiler/c2/irTests/TestFloat16ScalarOperations.java now passes on macOS aarch64 with ubsan enabled. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26062#issuecomment-3023799985 From shade at openjdk.org Tue Jul 1 12:33:51 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 12:33:51 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code Message-ID: We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. Additional testing: - [ ] GHA - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` - [ ] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) ------------- Commit messages: - Revert separate patch - Final - Proper option name and bump the limits - Fix Changes: https://git.openjdk.org/jdk/pull/26068/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26068&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360557 Stats: 15 lines in 3 files changed: 15 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26068.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26068/head:pull/26068 PR: https://git.openjdk.org/jdk/pull/26068 From shade at openjdk.org Tue Jul 1 12:46:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 12:46:38 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 12:26:44 GMT, Aleksey Shipilev wrote: > We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. > > There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. > > After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:C1MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. > > Pre-empting the question: "Well, why not use -Xcomp then, and make sure it inlines well?" The answer is in RFE as well: Xcomp causes _a lot_ of stray compilations for JDK and CTW infra itself. For small JARs in large corpus this eats precious testing time that we would instead like to spend on deeper inlining in the actual JAR code. This also does not force us to look into how CTW works in Xcomp at all; I expect some surprises there. Feather-touching the inlining heuristic paths to just accept methods without looking at profiles looks better. > > Tobias had an idea to implement the stress randomized inlining that would expand the scope of inlining. This improvement stacks well with it. This improvement provides the base case of inlining most reasonable methods, and then allow stress infra to inline some more on top of that. > > Additional testing: > - [ ] GHA > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > - [x] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) We are on par for CTW testing time, comparing to the state a week back: # Before CTW perf improvements real 5m0.528s user 79m5.193s sys 14m16.678s # Current mainline real 3m59.274s user 68m9.663s sys 5m19.026s # This PR real 4m56.248s user 89m48.364s sys 5m24.091s ------------- PR Comment: https://git.openjdk.org/jdk/pull/26068#issuecomment-3023863192 From mbaesken at openjdk.org Tue Jul 1 12:48:19 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 1 Jul 2025 12:48:19 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 14:44:03 GMT, Manuel H?ssig wrote: > `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. > > Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. > > Testing: > - [x] Github Actions > - [x] tier1, tier2 plus Oracle internal testing > - [x] `TestRedundantLea.java` on Alpine Linux With your patch included, the issue is gone on our Linux Alpine test machine. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26046#issuecomment-3023856713 From mhaessig at openjdk.org Tue Jul 1 12:48:19 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 12:48:19 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 12:42:05 GMT, Matthias Baesken wrote: >> `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. >> >> Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. >> >> Testing: >> - [x] Github Actions >> - [x] tier1, tier2 plus Oracle internal testing >> - [x] `TestRedundantLea.java` on Alpine Linux > > With your patch included, the issue is gone on our Linux Alpine test machine. @MBaesken, thank you for testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26046#issuecomment-3023862806 From mhaessig at openjdk.org Tue Jul 1 12:48:19 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 12:48:19 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules Message-ID: `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. Testing: - [x] Github Actions - [x] tier1, tier2 plus Oracle internal testing - [x] `TestRedundantLea.java` on Alpine Linux ------------- Commit messages: - Fix test Changes: https://git.openjdk.org/jdk/pull/26046/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26046&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361040 Stats: 12 lines in 1 file changed: 2 ins; 6 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26046/head:pull/26046 PR: https://git.openjdk.org/jdk/pull/26046 From bmaillard at openjdk.org Tue Jul 1 12:58:29 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 1 Jul 2025 12:58:29 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v4] In-Reply-To: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: > This PR prevents some missed ideal optimizations in IGVN by notifying users of type refinements made during CCP, addressing a missed optimization that caused a verification failure with `-XX:VerifyIterativeGVN=1110`. > > ### Context > During the compilation of the input program (obtained from the fuzzer, then simplified and added as a test) by C2, we end up with node `591 ModI` that takes `138 Phi` as its divisor input. An existing `Ideal` optimization is to get rid of the control input of a `ModINode` when we can prove that the divisor is never `0`. > > In this specific case, the type of the `PhiNode` gets refined during CCP, but the refinement fails to propagate to its users for the IGVN phase and the ideal optimization for the `ModINode` never happens. This results in a missed optimization and hits an assert in the verification phase of IGVN (when using `-XX:VerifyIterativeGVN=1110`). > > ![IGV screenshot](https://github.com/user-attachments/assets/5dee1ae6-9146-4115-922d-df33b7ccbd37) > > ### Detailed Analysis > > In `PhaseCCP::analyze`, we call `Value` for the `PhiNode`, which > results in a type refinement: the range gets restricted to `int:-13957..-1191`. > > ```c++ > // Pull from worklist; compute new value; push changes out. > // This loop is the meat of CCP. > while (worklist.size() != 0) { > Node* n = fetch_next_node(worklist); > DEBUG_ONLY(worklist_verify.push(n);) > if (n->is_SafePoint()) { > // Make sure safepoints are processed by PhaseCCP::transform even if they are > // not reachable from the bottom. Otherwise, infinite loops would be removed. > _root_and_safepoints.push(n); > } > const Type* new_type = n->Value(this); > if (new_type != type(n)) { > DEBUG_ONLY(verify_type(n, new_type, type(n));) > dump_type_and_node(n, new_type); > set_type(n, new_type); > push_child_nodes_to_worklist(worklist, n); > } > if (KillPathsReachableByDeadTypeNode && n->is_Type() && new_type == Type::TOP) { > // Keep track of Type nodes to kill CFG paths that use Type > // nodes that become dead. > _maybe_top_type_nodes.push(n); > } > } > DEBUG_ONLY(verify_analyze(worklist_verify);) > > > At the end of `PhaseCCP::analyze`, we obtain the following types in the side table: > - `int` for node `591` (`ModINode`) > - `int:-13957..-1191` for node `138` (`PhiNode`) > > If we call `find_node(138)->bottom_type()`, we get: > - `int` for both nodes > > There is no progress on the type of `ModINode` during CCP, because `ModINode::Value` > is not able to... Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: 8359602: update case for consistency Co-authored-by: Emanuel Peter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26017/files - new: https://git.openjdk.org/jdk/pull/26017/files/75de51df..005b2825 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26017&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26017&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26017.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26017/head:pull/26017 PR: https://git.openjdk.org/jdk/pull/26017 From bmaillard at openjdk.org Tue Jul 1 13:18:40 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 1 Jul 2025 13:18:40 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v2] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> <3cLLB7fms3S4WgqOVeb7D_ZDRFsJ_-ca3qfALlmzFeU=.1002ac91-1e35-4499-9d88-6d1f76c955d0@github.com> <0MJe_8nA-ILWqoVG-9rzuq5Pe9xX-FG2LN3k9Cy8nqU=.d724c6cf-cb02-45c4-95a4-5bd1fef7462b@github.com> Message-ID: On Tue, 1 Jul 2025 07:07:40 GMT, Beno?t Maillard wrote: >> @benoitmaillard Very nice work, and great description :) >> >>>Did you check if this allows enabling any of the other disabled verifications from [JDK-8347273](https://bugs.openjdk.org/browse/JDK-8347273)? >> >> That may be a lot of work. Not sure if it is worth checking all of them now. @TobiHartmann how much should he invest in this now? An alternative is just tackling all the other cases later. What do you think? >> >> @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? > >> @benoitmaillard Very nice work, and great description :) > > Thank you! @eme64 > >> > Did you check if this allows enabling any of the other disabled verifications from [JDK-8347273](https://bugs.openjdk.org/browse/JDK-8347273)? >> >> That may be a lot of work. Not sure if it is worth checking all of them now. @TobiHartmann how much should he invest in this now? An alternative is just tackling all the other cases later. What do you think? > > I have started to take a look at this and it seems that there are a lot of cases to check indeed. > >> @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? > > Yes, good point, I should I have mentioned this somewhere. The `phase->type(in(2))` call uses the type array from `PhaseValues`. The type array entry is actually modified earlier, in `PhaseCCP::analyze`, right after the `Value` call. You can see the `set_type` call [here](https://github.com/benoitmaillard/jdk/blob/75de51dff6d9cc3e9764737b29b9358992b488b7/src/hotspot/share/opto/phaseX.cpp#L2765). When this happens, users are added to the (local) worklist but again it does not change our issue as only value optimizations occur in that context. > > > @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? > > > > > > Yes, good point, I should I have mentioned this somewhere. The `phase->type(in(2))` call uses the type array from `PhaseValues`. The type array entry is actually modified earlier, in `PhaseCCP::analyze`, right after the `Value` call. You can see the `set_type` call [here](https://github.com/benoitmaillard/jdk/blob/75de51dff6d9cc3e9764737b29b9358992b488b7/src/hotspot/share/opto/phaseX.cpp#L2765). When this happens, users are added to the (local) worklist but again it does not change our issue as only value optimizations occur in that context. > > Thanks for the explanation! So it seems that `CCP` and `IGVN` share the type array, right? Ah yes, it is the `Compile::_types`: > > ``` > 461 // Shared type array for GVN, IGVN and CCP. It maps node idx -> Type*. > 462 Type_Array* _types; > ``` > > If the value behind `phase->type(in(2))` (the type array entry) is modified in `PhaseCCP::analyze`, right after the `Value` call, then why not do the notification there? If we did that, we would do more notification than what you now proposed (to do the notification in `PhaseCCP::transform_once` on the nodes that have a type that is different than the `bottom_type`). Are we possibly missing any important case with your approach now? Probably not, I would argue: with your approach we still notify for all live nodes that have a modified type, or are replaced with a constant. If we notified after every type update in `PhaseCCP::analyze`, we might notify for nodes multiple times, and we would also notify for nodes that are dead after CCP - both are unnecessary overheads. Alright, I just wanted to think this through - but it seems your approach is good :) I also considered doing it there in `PhaseCCP::analyze`, but I reached the same conclusion. Thanks for your help! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26017#issuecomment-3023978823 From snatarajan at openjdk.org Tue Jul 1 13:27:47 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 1 Jul 2025 13:27:47 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v7] In-Reply-To: References:

Message-ID: On Mon, 30 Jun 2025 16:24:03 GMT, Vladimir Kozlov wrote: >> Saranya Natarajan has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> merge with master >> Merge branch 'master' of https://git.openjdk.org/jdk into JDK-8325478 > > src/hotspot/share/opto/compile.cpp line 2533: > >> 2531: { >> 2532: TracePhase tp(_t_macroExpand); >> 2533: print_method(PHASE_BEFORE_MACRO_EXPANSION, 3); > > Should we move it before `mex.expand_macro_nodes()` call? Moving this would break the assumption of needing a `BEFORE_MACRO_ELIMINATION` as explained in the above reply. One way to go about this would be to include a `BEFORE_MACRO_ELIMINATION` phase and remove the `PHASE_BEFORE_MACRO_EXPANSION` phase as this is only place where it is used. Would this be a reasonable fix ? > src/hotspot/share/opto/phasetype.hpp line 94: > >> 92: flags(AFTER_LOOP_OPTS, "After Loop Optimizations") \ >> 93: flags(AFTER_MERGE_STORES, "After Merge Stores") \ >> 94: flags(AFTER_MACRO_ELIMINATION_STEP, "After Macro Elimination Step") \ > > What is the reason to not have `BEFORE_MACRO_ELIMINATION`? The two main reasons for not having a `BEFORE_MACRO_ELIMINATION` are as follows: - There is a dump in line 2426 (`print_method(PHASE_ITER_GVN_AFTER_EA, 2)`) before we call `mexp.eliminate_macro_nodes` which performs the functionality of having a `BEFORE_MACRO_ELIMINATION` for phase dump. - There is dump in line 2533 (`print_method(PHASE_BEFORE_MACRO_EXPANSION, 3)`) before eliminating macro nodes which performs the similar function. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25682#discussion_r2177603003 PR Review Comment: https://git.openjdk.org/jdk/pull/25682#discussion_r2177602894 From jbhateja at openjdk.org Tue Jul 1 13:28:21 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 13:28:21 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v2] In-Reply-To: References: Message-ID: > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolution ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26062/files - new: https://git.openjdk.org/jdk/pull/26062/files/bf78fbe6..d39c76f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=00-01 Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26062.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26062/head:pull/26062 PR: https://git.openjdk.org/jdk/pull/26062 From jbhateja at openjdk.org Tue Jul 1 13:28:22 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 13:28:22 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v2] In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 11:19:04 GMT, Manuel H?ssig wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Review comments resolution > > src/hotspot/share/opto/divnode.cpp line 833: > >> 831: } >> 832: >> 833: if (g_isfinite(t1->getf()) && t2->getf() == 0.0) { > > Is the `g_isfinite` for `t1` really needed? If the dividend is infinite then the result is also an infinity with the appropriate sign. Does this not result in `INF / 0.0` being calculated below? This would also be undefined by the C++ standard, would it not? Since as far as I know not all s390 models implement IEEE754, perhaps it would be better to remove the `g_isfinite` to prevent the native `INF / 0.0` below. As per C++ standard section 7.6.5 (expr.mul), behavior is undefined only if the second operand is 0.0. In all other situations, we can expect a standard-compliant C++ compiler to generate code following IEEE 754 semantics, irrespective of target floating point model, but Java semantics expect to return a NaN value if either of the operands is a NaN. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26062#discussion_r2177604366 From jbhateja at openjdk.org Tue Jul 1 13:36:20 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 13:36:20 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v3] In-Reply-To: References: Message-ID: <6cWhCvx8g-Gx4VoBHW1wA7atsa_Eq5wBhkDolUbP_X0=.31f8e688-7401-4f81-9b50-46b1997e96b5@github.com> > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Adding comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26062/files - new: https://git.openjdk.org/jdk/pull/26062/files/d39c76f4..0038654e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=01-02 Stats: 8 lines in 1 file changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26062.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26062/head:pull/26062 PR: https://git.openjdk.org/jdk/pull/26062 From galder at openjdk.org Tue Jul 1 13:47:38 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 1 Jul 2025 13:47:38 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal In-Reply-To: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Tue, 1 Jul 2025 11:35:06 GMT, Beno?t Maillard wrote: > This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. > > By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) > - [x] tier1-3, plus some internal testing > > Thank you for reviewing! Have you considered adding a test for this? Is that feasible? ------------- PR Review: https://git.openjdk.org/jdk/pull/26064#pullrequestreview-2975520753 From eastigeevich at openjdk.org Tue Jul 1 15:33:53 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 15:33:53 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 Message-ID: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. This PR adds a requirement for the test to be run on debug builds only. Tested: - Fastdebug: test passed - Slowdebug: test passed. - Release: test skipped. ------------- Commit messages: - 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 Changes: https://git.openjdk.org/jdk/pull/26072/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360936 Stats: 3 lines in 2 files changed: 1 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From missa at openjdk.org Tue Jul 1 15:37:47 2025 From: missa at openjdk.org (Mohamed Issa) Date: Tue, 1 Jul 2025 15:37:47 GMT Subject: Integrated: 8358179: Performance regression in Math.cbrt In-Reply-To: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: <12bHfivFgRF2s-Sr0SZY6DIywI30LQ63uedYzsncO0A=.ba272456-15df-493b-8247-e38a67796968@github.com> On Tue, 24 Jun 2025 22:33:56 GMT, Mohamed Issa wrote: > The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. > > 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. > 2. If these special values are found, return immediately with minimal modifications to the result register. > 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). > > The commands to run all relevant micro-benchmarks are posted below. > > `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` > `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` > > The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. > > Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. > > | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | > | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | > | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | > | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | > | [0] | 344990 | 627561 | +81.91 | > | [-0] | 291... This pull request has now been integrated. Changeset: 38f59f84 Author: Mohamed Issa Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/38f59f84c98dfd974eec0c05541b2138b149def7 Stats: 50 lines in 1 file changed: 11 ins; 36 del; 3 mod 8358179: Performance regression in Math.cbrt Reviewed-by: sviswanathan, sparasa, epeter ------------- PR: https://git.openjdk.org/jdk/pull/25962 From sviswanathan at openjdk.org Tue Jul 1 15:37:46 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 1 Jul 2025 15:37:46 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com>

Message-ID: On Mon, 30 Jun 2025 05:51:58 GMT, Emanuel Peter wrote: >>> I'll hold off with approval until someone else who is more knowledgeable has reviewed first. But feel free to ping me for a second review. >> >> @eme64 Second review with the latest changes? > > @missa-prime The patch still looks good, though I ran testing again because of the new changes. Should complete in about 24h. Thanks a lot @eme64. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25962#issuecomment-3024541704 From shade at openjdk.org Tue Jul 1 15:39:43 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 15:39:43 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 15:29:10 GMT, Evgeny Astigeevich wrote: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. > > This PR adds a requirement for the test to be run on debug builds only. > > Tested: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test skipped. Looks okay, but I am confused why the test did not fail before JDK-8359435? test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 32: > 30: * @requires vm.flagless > 31: * @requires os.arch=="aarch64" > 32: * @requires vm.debug==true Can be just `@requires vm.debug`. ------------- PR Review: https://git.openjdk.org/jdk/pull/26072#pullrequestreview-2975983374 PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2177921439 From phh at openjdk.org Tue Jul 1 15:41:41 2025 From: phh at openjdk.org (Paul Hohensee) Date: Tue, 1 Jul 2025 15:41:41 GMT Subject: RFR: 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 06:39:18 GMT, Boris Ulasevich wrote: > This change addresses an intermittent crash in CompileBroker::print_heapinfo() when accessing JVMCI metadata after a CodeBlob::purge(). > > The issue is a regression after: > - JDK-8343789: JVMCI metadata was moved from nmethod into a separate blob. > - JDK-8352112: CodeBlob::purge() was updated to set _mutable_data to blob_end(). > > The change zeroes out _mutable_data_size, _relocation_size, and _metadata_size in purge() so that after purge jvmci_data_size() returns 0 and CompileBroker::print_heapinfo() won?t touch an invalid _metadata. Marked as reviewed by phh (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25608#pullrequestreview-2975990062 From mablakatov at openjdk.org Tue Jul 1 15:48:00 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 15:48:00 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v5] In-Reply-To: References: Message-ID: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms > IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms > LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms > > > Fujitsu A64FX (SVE 512-bit): > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms > IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms > LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: fixup: remove undefined insts from aarch64-asmtest.py ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23181/files - new: https://git.openjdk.org/jdk/pull/23181/files/025d5166..df09ab65 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=03-04 Stats: 30 lines in 2 files changed: 0 ins; 9 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/23181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181 PR: https://git.openjdk.org/jdk/pull/23181 From kvn at openjdk.org Tue Jul 1 15:48:42 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 15:48:42 GMT Subject: RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 19:45:49 GMT, Ashutosh Mehra wrote: > Please reivew this patch to fix initialization and freeing of `AOTCodeAddressTable::_stubs_addr`. Changes are trivial I missed that this is for mainline. Approved. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26053#pullrequestreview-2976010588 From kvn at openjdk.org Tue Jul 1 15:52:37 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 15:52:37 GMT Subject: RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 19:45:49 GMT, Ashutosh Mehra wrote: > Please reivew this patch to fix initialization and freeing of `AOTCodeAddressTable::_stubs_addr`. Changes are trivial Yes, it is trivial. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26053#issuecomment-3024597730 From kvn at openjdk.org Tue Jul 1 15:59:38 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 15:59:38 GMT Subject: RFR: 8361180: Disable CompiledDirectCall verification with -VerifyInlineCaches In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:47:40 GMT, Aleksey Shipilev wrote: > Missed the spot when doing [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). There is a path from GC that calls into IC verification when cleaning the caches. See `nmethod::cleanup_inline_caches_impl`. It does verification per callsite, and does the whole thing during parallel GC cleanup, which is STW at least in G1. This gets expensive for CTW scenarios. We should wrap that under the same flag introduced by [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). > > Motivational improvements: > > > $ time CONF=linux-x86_64-server-fastdebug make test TEST=applications/ctw/modules/ > > # Current mainline > real 3m59.274s > user 68m9.663s > sys 5m19.026s > > # This PR > real 3m49.118s > user 65m37.962s > sys 5m15.441s Trivial. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26063#pullrequestreview-2976063372 From shade at openjdk.org Tue Jul 1 15:59:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 15:59:39 GMT Subject: RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 19:45:49 GMT, Ashutosh Mehra wrote: > Please reivew this patch to fix initialization and freeing of `AOTCodeAddressTable::_stubs_addr`. Changes are trivial Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26053#pullrequestreview-2976066699 From eastigeevich at openjdk.org Tue Jul 1 16:05:07 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 16:05:07 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. > > This PR adds a requirement for the test to be run on debug builds only. > > Tested: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test skipped. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Simplify requirement for debug build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/b2ba0a92..e91036bc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From kvn at openjdk.org Tue Jul 1 16:06:39 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 16:06:39 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 12:26:44 GMT, Aleksey Shipilev wrote: > We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. > > There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. > > After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:C1MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. > > Pre-empting the question: "Well, why not use -Xcomp then, and make sure it inlines well?" The answer is in RFE as well: Xcomp causes _a lot_ of stray compilations for JDK and CTW infra itself. For small JARs in large corpus this eats precious testing time that we would instead like to spend on deeper inlining in the actual JAR code. This also does not force us to look into how CTW works in Xcomp at all; I expect some surprises there. Feather-touching the inlining heuristic paths to just accept methods without looking at profiles looks better. > > Tobias had an idea to implement the stress randomized inlining that would expand the scope of inlining. This improvement stacks well with it. This improvement provides the base case of inlining most reasonable methods, and then allow stress infra to inline some more on top of that. > > Additional testing: > - [x] GHA > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > - [x] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) This has to be tested by us to make sure we clean up all issues this change find. ------------- PR Review: https://git.openjdk.org/jdk/pull/26068#pullrequestreview-2976094320 From mablakatov at openjdk.org Tue Jul 1 16:10:49 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 16:10:49 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> References:

<3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> Message-ID: <19rf4A0bxc4BstRmLivGkoCOm7Qa7YD6z1VJHJivCtg=.4a643c7b-4e79-4f37-b230-7231df3c68a8@github.com> On Tue, 1 Jul 2025 06:57:10 GMT, Xiaohong Gong wrote: >> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - cleanup: address nits, rename several symbols >> - cleanup: remove unreferenced definitions >> - Address review comments. >> >> - fixup: disable FP mul reduction auto-vectorization for all targets >> - fixup: add a tmp vReg to reduce_mul_integral_gt128b and >> reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified >> - cleanup: replace a complex lambda in the above methods with a loop >> - cleanup: rename symbols to follow the existing naming convention >> - cleanup: add asserts to SVE only instructions >> - split mul FP reduction instructions into strictly-ordered (default) >> and explicitly non strictly-ordered >> - remove redundant conditions in TestVectorFPReduction.java >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> | Benchmark | Before | After | Units | Diff | >> |---------------------------|----------|----------|--------|-------| >> | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | >> | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | >> | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | >> | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | >> | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | >> | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% | >> - Merge branch 'master' into 8343689-rebase >> - fixup: don't modify the value in vsrc >> >> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this >> change, the result of recursive folding is held in vtmp1. To be able to >> pass this intermediate result to reduce_mul_integral_le128b(), we would >> have to use another temporary FloatRegister, as vtmp1 would essentially >> act as vsrc. It's possible to get around this however: >> reduce_mul_integral_le128b() is modified so it's possible to pass >> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a >> temporary register in rules that match to reduce_mul_integral_gt128b(). >> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating >> - Use EXT instead of COMPACT to split a vector into two halves >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master ... > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2097: > >> 2095: sve_movprfx(vtmp1, vsrc); // copy >> 2096: sve_ext(vtmp1, vtmp1, vector_length_in_bytes / 2); // swap halves >> 2097: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); // multiply halves > >> sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); > > Can we use `ptrue` instread of `pgtmp` here? The higher bits can be computed, but they have not influences to the final results, right? Thanks! For some reason I thought that we don't have a dedicated predicate register for that. > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2106: > >> 2104: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vtmp2); // multiply halves >> 2105: vector_length_in_bytes = vector_length_in_bytes / 2; >> 2106: vector_length = vector_length / 2; > > I guess you want to update the `pgtmp` with new `vector_length`? But seems the code is missing. Anyway, maybe the it's not necessary to generate a predicate as I commented above. It isn't exactly necessary similarly to how we can always use `ptrue` here. But yeah, I'll just remove it following the suggestion above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178009839 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178007165 From mchevalier at openjdk.org Tue Jul 1 16:14:00 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 1 Jul 2025 16:14:00 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v2] In-Reply-To: References: Message-ID: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> > When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 > > This is enforced by restoring the old state, like in > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 > > That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: > > ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) > > > Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. > > Another situation is somewhat worse, when happening during parsing. It can lead to such cases: > > ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) > > The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? > > This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 > > And here there is the challenge: > - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) > - we can't really change the pointer, just the content > -... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Remove useless loop ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25936/files - new: https://git.openjdk.org/jdk/pull/25936/files/54b07e94..d51853ca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=00-01 Stats: 24 lines in 1 file changed: 0 ins; 2 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/25936.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25936/head:pull/25936 PR: https://git.openjdk.org/jdk/pull/25936 From mablakatov at openjdk.org Tue Jul 1 16:14:47 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 16:14:47 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> References:

<3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> Message-ID: On Tue, 1 Jul 2025 07:00:08 GMT, Xiaohong Gong wrote: >> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - cleanup: address nits, rename several symbols >> - cleanup: remove unreferenced definitions >> - Address review comments. >> >> - fixup: disable FP mul reduction auto-vectorization for all targets >> - fixup: add a tmp vReg to reduce_mul_integral_gt128b and >> reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified >> - cleanup: replace a complex lambda in the above methods with a loop >> - cleanup: rename symbols to follow the existing naming convention >> - cleanup: add asserts to SVE only instructions >> - split mul FP reduction instructions into strictly-ordered (default) >> and explicitly non strictly-ordered >> - remove redundant conditions in TestVectorFPReduction.java >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> | Benchmark | Before | After | Units | Diff | >> |---------------------------|----------|----------|--------|-------| >> | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | >> | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | >> | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | >> | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | >> | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | >> | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% | >> - Merge branch 'master' into 8343689-rebase >> - fixup: don't modify the value in vsrc >> >> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this >> change, the result of recursive folding is held in vtmp1. To be able to >> pass this intermediate result to reduce_mul_integral_le128b(), we would >> have to use another temporary FloatRegister, as vtmp1 would essentially >> act as vsrc. It's possible to get around this however: >> reduce_mul_integral_le128b() is modified so it's possible to pass >> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a >> temporary register in rules that match to reduce_mul_integral_gt128b(). >> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating >> - Use EXT instead of COMPACT to split a vector into two halves >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master ... > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 3536: > >> 3534: >> 3535: instruct reduce_mulF_gt128b(vRegF dst, vRegF fsrc, vReg vsrc, vReg tmp) %{ >> 3536: predicate(Matcher::vector_length_in_bytes(n->in(2)) > 16 && n->as_Reduction()->requires_strict_order()); > > Are there the cases that can match with this rule? Well, we don't match it right now for auto-vectorization as it doesn't worth it performance-wise. This might change for future implementations of SVE(2). I'd still prefer to keep it so the set of instructions is complete. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178014966 From mablakatov at openjdk.org Tue Jul 1 16:14:49 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 16:14:49 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v5] In-Reply-To: References:

Message-ID: <4XhaHrk4r0mgFmgfVUFvy0mktRz25oXfbln2Nhjcxg4=.a7e60853-979f-48de-9fa0-b8530a3b2ba5@github.com> On Tue, 1 Jul 2025 02:51:56 GMT, Hao Sun wrote: >> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: >> >> fixup: remove undefined insts from aarch64-asmtest.py > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3729: > >> 3727: #undef INSN >> 3728: >> 3729: // SVE aliases > > In the inital commit, asm test for `sve_(mov|movs|not|nots)` is added into `test/hotspot/gtest/aarch64/aarch64-asmtest.py`. Since the definition is removed in this commit, the corresponding asm test should be removed as well. Otherwise, JDK build failed on AArch64. > See the error log in GHA test. https://github.com/mikabl-arm/jdk/actions/runs/15974069085/job/45051902618 Thanks, fixed by https://github.com/openjdk/jdk/pull/23181/commits/df09ab65f75c7b6f99e0088b3871d7df7a8c4d1b ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178016339 From mablakatov at openjdk.org Tue Jul 1 16:25:49 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 16:25:49 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 06:21:43 GMT, Xiaohong Gong wrote: >> Why is it better that way? Currently the assertions check that we end up here if there computations that can be done only using SVE (length > neon && length <= sve). What would happen if a user operates 256b VectorAPI vectors on a 512b SVE platform? > > That would be the operations with partial vector size valid. For such cases, we will generate a mask in IR level, and a `VectorBlend` will be generated for this reduction case. Otherwise the result will be incorrect. So the vector size should be equal to MaxVectorSize theoretically. Thank you for elaborating on this :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178035000 From shade at openjdk.org Tue Jul 1 16:27:42 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 16:27:42 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code In-Reply-To: References:

Message-ID: <3T_kZY0tk0WcS4kkuGcoifEHjo1TlLbLBcjLxb4sD-I=.42bd833a-7fa2-4173-a165-f05e05e6e124@github.com> On Tue, 1 Jul 2025 16:04:12 GMT, Vladimir Kozlov wrote: > This has to be tested by us to make sure we clean up all issues this change find. Sure thing. There is a chicken-and-egg kind of problem that some bugs reproduce only with this PR, and maybe with extra inline tuning :) I am following up on failures that we are seeing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26068#issuecomment-3024727152 From snatarajan at openjdk.org Tue Jul 1 16:28:27 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 1 Jul 2025 16:28:27 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v8] In-Reply-To: References: Message-ID: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> > This changeset restructures the macro expansion phase to not include macro elimination and also adds a flag StressMacroElimination which randomizes macro elimination ordering for stress testing purposes. > > Changes: > - Implemented a method `eliminate_opaque_looplimit_macro_nodes` that removes the functionality for eliminating Opaque and LoopLimit nodes from the `expand_macro_nodes ` method. > - Introduced compiler phases` PHASE_AFTER_MACRO_ELIMINATION` > - Added a new Ideal phase for individual macro elimination steps. > - Implemented the flag `StressMacroElimination`. Added functionality tests for `StressMacroElimination`, similar to previous stress flag `StressMacroExpansion` ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)). > > Below is a sample screenshot (IGV print level 4 ) mainly showing the new phase . > ![image](https://github.com/user-attachments/assets/16013cd4-6ec6-4939-ac66-33bb03d59cd6) > > Questions to reviewers: > - Is the new macro elimination phase OK, or should we change anything? > - In `compile.cpp `, `PHASE_ITER_GVN_AFTER_ELIMINATION` follows `PHASE_AFTER_MACRO_ELIMINATION` in the current fix. Should `PHASE_ITER_GVN_AFTER_ELIMINATION` be removed ? > > Testing: > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: review comments fix part 1 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25682/files - new: https://git.openjdk.org/jdk/pull/25682/files/939be78b..791b6a0c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25682&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25682&range=06-07 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/25682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25682/head:pull/25682 PR: https://git.openjdk.org/jdk/pull/25682 From eastigeevich at openjdk.org Tue Jul 1 16:43:38 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 16:43:38 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 15:37:04 GMT, Aleksey Shipilev wrote: > Looks okay, but I am confused why the test did not fail before JDK-8359435? Just checked. It's not because of JDK-8359435. There were some changes which disabled printing debug info in release build. > test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 32: > >> 30: * @requires vm.flagless >> 31: * @requires os.arch=="aarch64" >> 32: * @requires vm.debug==true > > Can be just `@requires vm.debug`. Done ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3024772962 PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2178059281 From eastigeevich at openjdk.org Tue Jul 1 16:43:39 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 16:43:39 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 16:05:07 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. >> >> This PR adds a requirement for the test to be run on debug builds only. >> >> Tested: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test skipped. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Simplify requirement for debug build The test started failing after I had updated my branch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3024774351 From kvn at openjdk.org Tue Jul 1 16:54:42 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 16:54:42 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v4] In-Reply-To: References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: On Tue, 1 Jul 2025 06:52:32 GMT, Manuel H?ssig wrote: >> After integrating #25872 the calculation of the`CICompilerCount` ergonomic became dependent on the size of `NonNMethodCodeHeapSize`, which itself is an ergonomic based on the available memory. Thus, depending on the system, the test `compiler/arguments/TestCompilerCounts.java` failed, i.e. locally this failed, but not on CI servers. >> >> This PR changes the test to reflect the changes introduced in #25872. >> >> Testing: >> - [ ] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15932906313) >> - [x] tier1,tier2 plus Oracle internal testing > > Manuel H?ssig has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace > > Co-authored-by: Andrey Turbanov Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26024#pullrequestreview-2976217571 From kvn at openjdk.org Tue Jul 1 17:08:43 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 17:08:43 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v7] In-Reply-To: References:

Message-ID: <0KyjZgLy8vVqV3du6Y1LIKmGTnDYxEPlYgTrVVd_ey4=.b2d40e6c-ff0e-4e88-bc70-e06219a15608@github.com> On Tue, 1 Jul 2025 13:24:49 GMT, Saranya Natarajan wrote: >> src/hotspot/share/opto/compile.cpp line 2533: >> >>> 2531: { >>> 2532: TracePhase tp(_t_macroExpand); >>> 2533: print_method(PHASE_BEFORE_MACRO_EXPANSION, 3); >> >> Should we move it before `mex.expand_macro_nodes()` call? > > Moving this would break the assumption of needing a `BEFORE_MACRO_ELIMINATION` as explained in the above reply. One way to go about this would be to include a `BEFORE_MACRO_ELIMINATION` phase and remove the `PHASE_BEFORE_MACRO_EXPANSION` phase as this is only place where it is used. Would this be a reasonable fix ? So `MACRO_ELIMINATION` is subset of `MACRO_EXPANSION` >> src/hotspot/share/opto/phasetype.hpp line 94: >> >>> 92: flags(AFTER_LOOP_OPTS, "After Loop Optimizations") \ >>> 93: flags(AFTER_MERGE_STORES, "After Merge Stores") \ >>> 94: flags(AFTER_MACRO_ELIMINATION_STEP, "After Macro Elimination Step") \ >> >> What is the reason to not have `BEFORE_MACRO_ELIMINATION`? > > The two main reasons for not having a `BEFORE_MACRO_ELIMINATION` are as follows: > - There is a dump in line 2426 (`print_method(PHASE_ITER_GVN_AFTER_EA, 2)`) before we call `mexp.eliminate_macro_nodes` which performs the functionality of having a `BEFORE_MACRO_ELIMINATION` for phase dump. > - There is dump in line 2533 (`print_method(PHASE_BEFORE_MACRO_EXPANSION, 3)`) before eliminating macro nodes which performs the similar function. ok ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25682#discussion_r2178120635 PR Review Comment: https://git.openjdk.org/jdk/pull/25682#discussion_r2178120168 From kvn at openjdk.org Tue Jul 1 17:08:41 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 17:08:41 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v8] In-Reply-To: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> References: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> Message-ID: On Tue, 1 Jul 2025 16:28:27 GMT, Saranya Natarajan wrote: >> This changeset restructures the macro expansion phase to not include macro elimination and also adds a flag StressMacroElimination which randomizes macro elimination ordering for stress testing purposes. >> >> Changes: >> - Implemented a method `eliminate_opaque_looplimit_macro_nodes` that removes the functionality for eliminating Opaque and LoopLimit nodes from the `expand_macro_nodes ` method. >> - Introduced compiler phases` PHASE_AFTER_MACRO_ELIMINATION` >> - Added a new Ideal phase for individual macro elimination steps. >> - Implemented the flag `StressMacroElimination`. Added functionality tests for `StressMacroElimination`, similar to previous stress flag `StressMacroExpansion` ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)). >> >> Below is a sample screenshot (IGV print level 4 ) mainly showing the new phase . >> ![image](https://github.com/user-attachments/assets/16013cd4-6ec6-4939-ac66-33bb03d59cd6) >> >> Questions to reviewers: >> - Is the new macro elimination phase OK, or should we change anything? >> - In `compile.cpp `, `PHASE_ITER_GVN_AFTER_ELIMINATION` follows `PHASE_AFTER_MACRO_ELIMINATION` in the current fix. Should `PHASE_ITER_GVN_AFTER_ELIMINATION` be removed ? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > review comments fix part 1 Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25682#pullrequestreview-2976289575 From shade at openjdk.org Tue Jul 1 17:13:43 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 17:13:43 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References:

Message-ID: On Wed, 25 Jun 2025 09:16:48 GMT, Xiaohong Gong wrote: >> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below: >> >> Benchmark Mode Cnt Unit SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.31... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Address review comments > - Merge 'jdk:master' into JDK-8355563 > - 8355563: VectorAPI: Refactor current implementation of subword gather load API Marked as reviewed by psandoz (Reviewer). This is a nice simplification, Java changes look good. I'll let the Intel folks sign-off related to regressions. IMO minor regressions like this are acceptable if the generated code quality is good, and if the benchmark reports higher variance and averaging results from multiple forks close the gap. (In this case i don't understand how the Java changes impacts alignment). ------------- PR Review: https://git.openjdk.org/jdk/pull/25138#pullrequestreview-2976493924 PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3025029477 From dlunden at openjdk.org Tue Jul 1 18:08:40 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 1 Jul 2025 18:08:40 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v8] In-Reply-To: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> References: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> Message-ID: On Tue, 1 Jul 2025 16:28:27 GMT, Saranya Natarajan wrote: >> This changeset restructures the macro expansion phase to not include macro elimination and also adds a flag StressMacroElimination which randomizes macro elimination ordering for stress testing purposes. >> >> Changes: >> - Implemented a method `eliminate_opaque_looplimit_macro_nodes` that removes the functionality for eliminating Opaque and LoopLimit nodes from the `expand_macro_nodes ` method. >> - Introduced compiler phases` PHASE_AFTER_MACRO_ELIMINATION` >> - Added a new Ideal phase for individual macro elimination steps. >> - Implemented the flag `StressMacroElimination`. Added functionality tests for `StressMacroElimination`, similar to previous stress flag `StressMacroExpansion` ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)). >> >> Below is a sample screenshot (IGV print level 4 ) mainly showing the new phase . >> ![image](https://github.com/user-attachments/assets/16013cd4-6ec6-4939-ac66-33bb03d59cd6) >> >> Questions to reviewers: >> - Is the new macro elimination phase OK, or should we change anything? >> - In `compile.cpp `, `PHASE_ITER_GVN_AFTER_ELIMINATION` follows `PHASE_AFTER_MACRO_ELIMINATION` in the current fix. Should `PHASE_ITER_GVN_AFTER_ELIMINATION` be removed ? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > review comments fix part 1 Marked as reviewed by dlunden (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25682#pullrequestreview-2976500092 From sviswanathan at openjdk.org Tue Jul 1 21:33:44 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 1 Jul 2025 21:33:44 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References:

Message-ID: On Mon, 30 Jun 2025 08:38:27 GMT, Jatin Bhateja wrote: >> Intel@ AVX10 ISA [1] extensions added new floating point MIN/MAX instructions which comply with definitions in IEEE-754-2019 standard section 9.6 and can directly emulate Math.min/max semantics without the need for any special handling for NaN, +0.0 or -0.0 detection. >> >> **The following pseudo-code describes the existing algorithm for min/max[FD]:** >> >> Move the non-negative value to the second operand; this will ensure that we correctly handle 0.0 and -0.0 values, if values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. Existing MINPS and MAXPS semantics only check for NaN as the second operand; hence, we need special handling to check for NaN at the first operand. >> >> btmp = (b < +0.0) ? a : b >> atmp = (b < +0.0) ? b : a >> Tmp = Max_Float(atmp , btmp) >> Res = (atmp == NaN) ? atmp : Tmp >> >> For min[FD] we need a small tweak in the above algorithm, i.e., move the non-negative value to the first operand, this will ensure that we correctly select -0.0 if both the operands being compared are 0.0 or -0.0. >> >> btmp = (b < +0.0) ? b : a >> atmp = (b < +0.0) ? a : b >> Tmp = Max_Float(atmp , btmp) >> Res = (atmp == NaN) ? atmp : Tmp >> >> Thus, we need additional special handling for NaNs and +/-0.0 to compute floating-point min/max values to comply with the semantics of Math.max/min APIs using existing MINPS / MAXPS instructions. AVX10.2 added a new instruction, VPMINMAX[SH,SS,SD]/[PH,PS,PD], which comprehensively handles special cases, thereby eliminating the need for special handling. >> >> Patch emits new instructions for reduction and non-reduction operations for single, double, and Float16 type. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin >> >> [1] https://www.intel.com/content/www/us/en/content-details/856721/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html?wapkw=AVX10 > > Jatin Bhateja has updated the pull request incrementally with two additional commits since the last revision: > > - Update src/hotspot/cpu/x86/x86_64.ad > > Co-authored-by: Manuel H?ssig > - Update src/hotspot/cpu/x86/x86_64.ad > > Co-authored-by: Manuel H?ssig src/hotspot/cpu/x86/assembler_x86.cpp line 8800: > 8798: attributes.set_is_evex_instruction(); > 8799: attributes.set_embedded_opmask_register_specifier(mask); > 8800: attributes.set_address_attributes(/* tuple_type */ EVEX_FVM, /* input_size_in_bits */ EVEX_NObit); It looks to me that the tuple_type should be EVEX_FV for all of evminmax ps, pd, ph. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25914#discussion_r2178735442 From kbarrett at openjdk.org Wed Jul 2 00:30:44 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Jul 2025 00:30:44 GMT Subject: RFR: 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string In-Reply-To: References:

Message-ID: On Mon, 30 Jun 2025 23:16:20 GMT, Vladimir Kozlov wrote: >> Please review this trivial fix of a format string. The value being printed is >> TieredStopAtLevel, which is of type intx, so "%zd" should be used instead of "%d". >> >> Testing: mach5 tier1 > > Thank you for checking other solutions. > > Current fix is good. Thanks for reviews @vnkozlov , @mhaessig , and @mur47x111 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26051#issuecomment-3025913681 From kbarrett at openjdk.org Wed Jul 2 00:30:44 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Jul 2025 00:30:44 GMT Subject: Integrated: 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 16:14:08 GMT, Kim Barrett wrote: > Please review this trivial fix of a format string. The value being printed is > TieredStopAtLevel, which is of type intx, so "%zd" should be used instead of "%d". > > Testing: mach5 tier1 This pull request has now been integrated. Changeset: c6448dc3 Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/c6448dc3afb1da9d93bb94804aa1971a650b91b7 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string Reviewed-by: kvn, mhaessig, yzheng ------------- PR: https://git.openjdk.org/jdk/pull/26051 From sviswanathan at openjdk.org Wed Jul 2 00:31:41 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 2 Jul 2025 00:31:41 GMT Subject: RFR: 8360116: Add support for AVX10 floating point minmax instruction [v5] In-Reply-To: References:

Message-ID: On Tue, 1 Jul 2025 23:49:30 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with two additional commits since the last revision: >> >> - Update src/hotspot/cpu/x86/x86_64.ad >> >> Co-authored-by: Manuel H?ssig