From sparasa at openjdk.org Tue Jul 1 00:01:59 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 1 Jul 2025 00:01:59 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt [v2] In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: <27l1noh4qLvBGFOqhDNxmv-Ikyuc8AOQNRgIT4RtbZM=.5c199ba5-a2a7-4e98-9459-68ed4c55b73f@github.com> On Fri, 27 Jun 2025 01:43:16 GMT, Mohamed Issa wrote: >> The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. >> >> 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. >> 2. If these special values are found, return immediately with minimal modifications to the result register. >> 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). >> >> The commands to run all relevant micro-benchmarks are posted below. >> >> `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` >> `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` >> >> The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. >> >> Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. >> >> | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | >> | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | >> | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | >> | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | >> | [0] | 344990 | 627561 | +81.91 | >> | [-0] ... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Ensure ABS_MASK is a 128-bit memory sized location and only use equal enum for UCOMISD checks I did independent testing by running the correctness tests and performance benchmarks. The change looks good to me. Thanks, Vamsi ------------- Marked as reviewed by sparasa (Author). PR Review: https://git.openjdk.org/jdk/pull/25962#pullrequestreview-2973095482 From haosun at openjdk.org Tue Jul 1 02:54:47 2025 From: haosun at openjdk.org (Hao Sun) Date: Tue, 1 Jul 2025 02:54:47 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 13:25:09 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - cleanup: address nits, rename several symbols > - cleanup: remove unreferenced definitions > - Address review comments. > > - fixup: disable FP mul reduction auto-vectorization for all targets > - fixup: add a tmp vReg to reduce_mul_integral_gt128b and > reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified > - cleanup: replace a complex lambda in the above methods with a loop > - cleanup: rename symbols to follow the existing naming convention > - cleanup: add asserts to SVE only instructions > - split mul FP reduction instructions into strictly-ordered (default) > and explicitly non strictly-ordered > - remove redundant conditions in TestVectorFPReduction.java > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > | Benchmark | Before | After | Units | Diff | > |---------------------------|----------|----------|--------|-------| > | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | > | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | > | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | > | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | > | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | > | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% | > - Merge branch 'master' into 8343689-rebase > - fixup: don't modify the value in vsrc > > Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this > change, the result of recursive folding is held in vtmp1. To be able to > pass this intermediate result to reduce_mul_integral_le128b(), we would > have to use another temporary FloatRegister, as vtmp1 would essentially > act as vsrc. It's possible to get around this however: > reduce_mul_integral_le128b() is modified so it's possible to pass > matching vsrc and vtmp2 arguments. By doing this, we save ourselves a > temporary register in rules that match to reduce_mul_integral_gt128b(). > - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating > - Use EXT instead of COMPACT to split a vector into two halves > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > Short... src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3729: > 3727: #undef INSN > 3728: > 3729: // SVE aliases In the inital commit, asm test for `sve_(mov|movs|not|nots)` is added into `test/hotspot/gtest/aarch64/aarch64-asmtest.py`. Since the definition is removed in this commit, the corresponding asm test should be removed as well. Otherwise, JDK build failed on AArch64. See the error log in GHA test. https://github.com/mikabl-arm/jdk/actions/runs/15974069085/job/45051902618 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176310497 From xgong at openjdk.org Tue Jul 1 06:04:29 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:04:29 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors Message-ID: ### Background On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. ### Impact Analysis #### 1. Vector types Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. #### 2. Vector API No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. #### 3. Auto-vectorization Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. #### 4. Codegen of vector nodes NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. Details: - Lanewise vector operations are unaffected as explained above. - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_supported_vector()` would be beneficial. - Missing codegen support for type conversions with 32-bit input or output vector size should be added. ### Main changes: - Support 2 shorts vector types. The supported min vector element count for each basic type is: - `T_BOOLEAN`: 2 - `T_BYTE`: 4 - `T_CHAR`: 4 - `T_SHORT`: 2 (new supported) - `T_INT`/`T_FLOAT`/`T_LONG`/`T_DOUBLE`: 2 - Add codegen support for `Vector[U]Cast` with 32-bit input or output vector size. `VectorReinterpret` has already considered the 32-bit vector size cases. - Unsupport reductions with less than 8 bytes vector size explicitly. - Add additional IR tests for Vector API type conversions. - Add JMH benchmark for auto-vectorization with two 16-bit lanes. ### Test Tested hotspot/jdk/langtools - all tests passed. ### Performance Following shows the performance improvement of relative VectorAPI JMHs on a NVIDIA Grace (128-bit SVE2) machine: Benchmark SIZE Mode Unit Before After Gain VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 731.529 26278.599 35.92 VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 366.461 10595.767 28.91 VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 315.791 14327.682 45.37 VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 158.485 7261.847 45.82 VectorZeroExtend.short2Long 128 thrpt ops/ms 1447.243 898666.972 620.95 And here is the performance improvement of the added JMH on Grace: Benchmark LEN Mode Unit Before After Gain VectorTwoShorts.addVec2S 64 avgt ns/op 20.948 12.683 1.65 VectorTwoShorts.addVec2S 128 avgt ns/op 40.073 22.703 1.76 VectorTwoShorts.addVec2S 512 avgt ns/op 157.447 83.691 1.88 VectorTwoShorts.addVec2S 1024 avgt ns/op 313.022 165.085 1.89 VectorTwoShorts.mulVec2S 64 avgt ns/op 20.981 12.647 1.65 VectorTwoShorts.mulVec2S 128 avgt ns/op 40.279 22.637 1.77 VectorTwoShorts.mulVec2S 512 avgt ns/op 158.642 83.371 1.90 VectorTwoShorts.mulVec2S 1024 avgt ns/op 314.788 165.205 1.90 VectorTwoShorts.reverseBytesVec2S 64 avgt ns/op 17.739 9.106 1.94 VectorTwoShorts.reverseBytesVec2S 128 avgt ns/op 32.591 15.632 2.08 VectorTwoShorts.reverseBytesVec2S 512 avgt ns/op 126.154 55.284 2.28 VectorTwoShorts.reverseBytesVec2S 1024 avgt ns/op 254.592 107.457 2.36 We can observe the similar uplift on an AArch64 N1 (NEON) machine. ------------- Commit messages: - 8359419: AArch64: Relax min vector length to 32-bit for short vectors Changes: https://git.openjdk.org/jdk/pull/26057/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8359419 Stats: 306 lines in 8 files changed: 196 ins; 9 del; 101 mod Patch: https://git.openjdk.org/jdk/pull/26057.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26057/head:pull/26057 PR: https://git.openjdk.org/jdk/pull/26057 From xgong at openjdk.org Tue Jul 1 06:09:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:09:44 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 09:16:48 GMT, Xiaohong Gong wrote: >> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below: >> >> Benchmark Mode Cnt Unit SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.31... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Address review comments > - Merge 'jdk:master' into JDK-8355563 > - 8355563: VectorAPI: Refactor current implementation of subword gather load API Ping again! Thanks in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3021961883 From dfenacci at openjdk.org Tue Jul 1 06:25:42 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 1 Jul 2025 06:25:42 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v3] In-Reply-To: References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: On Mon, 30 Jun 2025 08:58:07 GMT, Manuel H?ssig wrote: >> After integrating #25872 the calculation of the`CICompilerCount` ergonomic became dependent on the size of `NonNMethodCodeHeapSize`, which itself is an ergonomic based on the available memory. Thus, depending on the system, the test `compiler/arguments/TestCompilerCounts.java` failed, i.e. locally this failed, but not on CI servers. >> >> This PR changes the test to reflect the changes introduced in #25872. >> >> Testing: >> - [ ] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15932906313) >> - [x] tier1,tier2 plus Oracle internal testing > > Manuel H?ssig has updated the pull request incrementally with two additional commits since the last revision: > > - Remove superfluous newline > - Add copyright Looks good to me. Thanks! ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/26024#pullrequestreview-2973682287 From xgong at openjdk.org Tue Jul 1 06:27:43 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:27:43 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 12:05:08 GMT, Mikhail Ablakatov wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2087: >> >>> 2085: assert(vector_length_in_bytes > FloatRegister::neon_vl, "ASIMD impl should be used instead"); >>> 2086: assert(vector_length_in_bytes <= FloatRegister::sve_vl_max, "unsupported vector length"); >>> 2087: assert(is_power_of_2(vector_length_in_bytes), "unsupported vector length"); >> >> Better to compare with `MaxVectorSize`. >> >> I suggest using `assert(length_in_bytes == MaxVectorSize, "invalid vector length");` and putting this assertion in `aarch64_vector.ad` file, i.e. inside the matching rule. > > Why is it better that way? Currently the assertions check that we end up here if there computations that can be done only using SVE (length > neon && length <= sve). What would happen if a user operates 256b VectorAPI vectors on a 512b SVE platform? That would be the operations with partial vector size valid. For such cases, we will generate a mask in IR level, and a `VectorBlend` will be generated for this reduction case. Otherwise the result will be incorrect. So the vector size should be equal to MaxVectorSize theoretically. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176524365 From xgong at openjdk.org Tue Jul 1 06:27:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:27:44 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References: <2jvFY4hq9FPdk9e4Zg6LRPdRVhDTGgxofL-we8c-mns=.4e6ce509-67a4-4e46-a661-2b0951f88731@github.com> Message-ID: On Mon, 30 Jun 2025 12:20:19 GMT, Mikhail Ablakatov wrote: >> I have the same concern about the order issue with @eme64. >> Should we only enable this only for VectorAPI case, which doesn't require strict-order? > > FP reductions have been disabled for auto-vectorization, please see the following comment: https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130 . You can also check https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067 to see how the patch affects auto-vectorization performance. The only benchmarks that saw a performance uplift on a 256b SVE platform is `VectorReduction2.WithSuperword.intMulBig` (which is fine since it's an integer benchmark). Yes, these operations are disabled for SLP. But maybe we could add an assertion to check the restrict flag in the match rules. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176528442 From epeter at openjdk.org Tue Jul 1 06:30:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 06:30:44 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt [v2] In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: On Fri, 27 Jun 2025 01:43:16 GMT, Mohamed Issa wrote: >> The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. >> >> 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. >> 2. If these special values are found, return immediately with minimal modifications to the result register. >> 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). >> >> The commands to run all relevant micro-benchmarks are posted below. >> >> `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` >> `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` >> >> The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. >> >> Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. >> >> | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | >> | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | >> | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | >> | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | >> | [0] | 344990 | 627561 | +81.91 | >> | [-0] ... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Ensure ABS_MASK is a 128-bit memory sized location and only use equal enum for UCOMISD checks Did not review the patch in detail, but looks reasonable. Tests are passing on my end with commit 3 / v01. @missa-prime Thanks for taking care of this! ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25962#pullrequestreview-2973696349 From epeter at openjdk.org Tue Jul 1 06:38:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 06:38:45 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 06:07:03 GMT, Xiaohong Gong wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: >> >> - Address review comments >> - Merge 'jdk:master' into JDK-8355563 >> - 8355563: VectorAPI: Refactor current implementation of subword gather load API > > Ping again! Thanks in advance! @XiaohongGong I'm a little busy at the moment, and soon going on a summer vacation, so I cannot promise a full review soon. Feel free to ask someone else to have a look. I quickly looked through your new benchmark results you published after integration of https://github.com/openjdk/jdk/pull/25539. There seem to still be a few cases where `Gain < 1`. Especially: GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92 GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90 GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90 and GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96 GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95 GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96 and GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95 Do you know what happens in those cases? That said: https://github.com/openjdk/jdk/pull/25539 seems to have been quite the sucess, there are way fewer regressions now than before ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022057434 From xgong at openjdk.org Tue Jul 1 06:43:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 06:43:44 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: Message-ID: <7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> On Tue, 1 Jul 2025 06:07:03 GMT, Xiaohong Gong wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: >> >> - Address review comments >> - Merge 'jdk:master' into JDK-8355563 >> - 8355563: VectorAPI: Refactor current implementation of subword gather load API > > Ping again! Thanks in advance! > @XiaohongGong I'm a little busy at the moment, and soon going on a summer vacation, so I cannot promise a full review soon. Feel free to ask someone else to have a look. > > I quickly looked through your new benchmark results you published after integration of #25539. There seem to still be a few cases where `Gain < 1`. Especially: > > ``` > GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92 > GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90 > GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90 > ``` > > and > > ``` > GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96 > GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95 > GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96 > ``` > > and > > ``` > GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95 > ``` > > Do you know what happens in those cases? Thanks for your input! Yes, I spent some time making an analysis on these little regressions. Seems there are the architecture HW influences like the cache miss or code alignment. I tried with a larger loop alignment like 32, and the performance will be improved and regressions are gone. Since I'm not quite familiar with X86 architectures, I'm not sure of the exact point. Any suggestions on that? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022088710 From jbhateja at openjdk.org Tue Jul 1 06:45:42 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 06:45:42 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Thu, 19 Jun 2025 05:15:40 GMT, Srinivas Vamsi Parasa wrote: > The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. > > In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. > > Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. src/hotspot/cpu/x86/macroAssembler_x86.cpp line 800: > 798: void MacroAssembler::push(Register src, bool is_pair) { > 799: if (is_pair && VM_Version::supports_apx_f()) { > 800: pushp(src); What does is_pair signify here ? You are just pushing one register. Do you intend to use has_matching_pop ? src/hotspot/cpu/x86/macroAssembler_x86.cpp line 807: > 805: > 806: void MacroAssembler::pop(Register dst, bool is_pair) { > 807: if (is_pair && VM_Version::supports_apx_f()) { Same as above, new argument suggestion: please use has_matching_push. I understand your purpose here is to delegate the responsibility of balancing of PPX pair to the user. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2176508727 PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2176511119 From jbhateja at openjdk.org Tue Jul 1 06:45:42 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 06:45:42 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 06:11:29 GMT, Jatin Bhateja wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 807: > >> 805: >> 806: void MacroAssembler::pop(Register dst, bool is_pair) { >> 807: if (is_pair && VM_Version::supports_apx_f()) { > > Same as above, new argument suggestion: please use has_matching_push. > I understand your purpose here is to delegate the responsibility of balancing of PPX pair to the user. For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker in the stub snippets using push/pop instruction sequence and wrap the actual assembler call underneath. The idea here is to catch the balancing error upfront as PPX is purely a performance hint. Instructions with this hint have the same functional semantics as those without. PPX hints set by the compiler that violate the balancing rule may turn off the PPX optimization, but they will not affect program semantics.. class APXPushPopPairTracker { private: int _counter; public: APXPushPopPairTracker() _counter(0) { } ~APXPushPopPairTracker() { assert(_counter == 0, "Push/pop pair mismatch"); } void push(Register reg, bool has_matching_pop) { if (has_matching_pop && VM_Version::supports_apx_f()) { Assembler::pushp(reg); incrementCounter(); } else { Assembler::push(reg); } } void pop(Register reg, bool has_matching_push) { if (has_matching_push && VM_Version::supports_apx_f()) { Assembler::popp(reg); decrementCounter(); } else { Assembler::pop(reg); } } void incrementCounter() { _counter++; } void decrementCounter() { _counter--; } } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2176549150 From jbhateja at openjdk.org Tue Jul 1 06:48:42 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 06:48:42 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 06:11:29 GMT, Jatin Bhateja wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 807: > >> 805: >> 806: void MacroAssembler::pop(Register dst, bool is_pair) { >> 807: if (is_pair && VM_Version::supports_apx_f()) { > > Same as above, new argument suggestion: please use has_matching_push. > I understand your purpose here is to delegate the responsibility of balancing of PPX pair to the user. For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker in the stub snippets using push/pop instruction sequence and wrap the actual assembler call underneath. The idea here is to catch the balancing error upfront as PPX is purely a performance hint. Instructions with this hint have the same functional semantics as those without. PPX hints set by the compiler that violate the balancing rule may turn off the PPX optimization, but they will not affect program semantics.. class APXPushPopPairTracker { private: int _counter; public: APXPushPopPairTracker() _counter(0) { } ~APXPushPopPairTracker() { assert(_counter == 0, "Push/pop pair mismatch"); } void push(Register reg, bool has_matching_pop) { if (has_matching_pop && VM_Version::supports_apx_f()) { Assembler::pushp(reg); incrementCounter(); } else { Assembler::push(reg); } } void pop(Register reg, bool has_matching_push) { if (has_matching_push && VM_Version::supports_apx_f()) { Assembler::popp(reg); decrementCounter(); } else { Assembler::pop(reg); } } void incrementCounter() { _counter++; } void decrementCounter() { _counter--; } } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2176564840 From mhaessig at openjdk.org Tue Jul 1 06:50:46 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 06:50:46 GMT Subject: RFR: 8361092: Remove trailing spaces in x86 ad files In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 15:34:18 GMT, Manuel H?ssig wrote: > This PR fixes some trailing spaces in `x86_64.ad`. > > Testing: > - [ ] Github Actions Thank you for your reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26048#issuecomment-3022106129 From mhaessig at openjdk.org Tue Jul 1 06:50:47 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 06:50:47 GMT Subject: Integrated: 8361092: Remove trailing spaces in x86 ad files In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 15:34:18 GMT, Manuel H?ssig wrote: > This PR fixes some trailing spaces in `x86_64.ad`. > > Testing: > - [ ] Github Actions This pull request has now been integrated. Changeset: b32ccf2c Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/b32ccf2cb23e0180187f4238140583a923fc27c4 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod 8361092: Remove trailing spaces in x86 ad files Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/26048 From mhaessig at openjdk.org Tue Jul 1 06:52:32 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 06:52:32 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v4] In-Reply-To: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: > After integrating #25872 the calculation of the`CICompilerCount` ergonomic became dependent on the size of `NonNMethodCodeHeapSize`, which itself is an ergonomic based on the available memory. Thus, depending on the system, the test `compiler/arguments/TestCompilerCounts.java` failed, i.e. locally this failed, but not on CI servers. > > This PR changes the test to reflect the changes introduced in #25872. > > Testing: > - [ ] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15932906313) > - [x] tier1,tier2 plus Oracle internal testing Manuel H?ssig has updated the pull request incrementally with one additional commit since the last revision: Fix whitespace Co-authored-by: Andrey Turbanov ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26024/files - new: https://git.openjdk.org/jdk/pull/26024/files/8beb5898..71767802 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26024&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26024&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26024.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26024/head:pull/26024 PR: https://git.openjdk.org/jdk/pull/26024 From mhaessig at openjdk.org Tue Jul 1 06:52:33 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 06:52:33 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v3] In-Reply-To: References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: On Mon, 30 Jun 2025 19:48:44 GMT, Andrey Turbanov wrote: >> Manuel H?ssig has updated the pull request incrementally with two additional commits since the last revision: >> >> - Remove superfluous newline >> - Add copyright > > test/hotspot/jtreg/compiler/arguments/TestCompilerCounts.java line 159: > >> 157: // Tiered modes >> 158: int tieredCount = heuristicCount(cpus, Compilation.Tiered, debug); >> 159: pass(tieredCount, opt, "-XX:NonNMethodCodeHeapSize=" + NonNMethodCodeHeapSize); > > Suggestion: > > pass(tieredCount, opt, "-XX:NonNMethodCodeHeapSize=" + NonNMethodCodeHeapSize); Good catch, thank you. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26024#discussion_r2176568786 From epeter at openjdk.org Tue Jul 1 06:55:41 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 06:55:41 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: <7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> References: <7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> Message-ID: On Tue, 1 Jul 2025 06:41:32 GMT, Xiaohong Gong wrote: >> Ping again! Thanks in advance! > >> @XiaohongGong I'm a little busy at the moment, and soon going on a summer vacation, so I cannot promise a full review soon. Feel free to ask someone else to have a look. >> >> I quickly looked through your new benchmark results you published after integration of #25539. There seem to still be a few cases where `Gain < 1`. Especially: >> >> ``` >> GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92 >> GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90 >> GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90 >> ``` >> >> and >> >> ``` >> GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96 >> GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95 >> GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96 >> ``` >> >> and >> >> ``` >> GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95 >> ``` >> >> Do you know what happens in those cases? > > Thanks for your input! Yes, I spent some time making an analysis on these little regressions. Seems there are the architecture HW influences like the cache miss or code alignment. I tried with a larger loop alignment like 32, and the performance will be improved and regressions are gone. Since I'm not quite familiar with X86 architectures, I'm not sure of the exact point. Any suggestions on that? @XiaohongGong Maybe someone from Intel (@jatin-bhateja @sviswa7) can help you with the x86 specific issues. You could always use hardware counters to measure cache misses. Also if the vectors are not cache-line aligned, there may be split loads or stores. Also that can be measured with hardware counters. Maybe the benchmark needs to be improved somehow, to account for issues with alignment. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022132271 From dfenacci at openjdk.org Tue Jul 1 06:58:40 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Tue, 1 Jul 2025 06:58:40 GMT Subject: RFR: 8360783: CTW: Skip deoptimization between tiers [v2] In-Reply-To: References: Message-ID: On Fri, 27 Jun 2025 08:38:31 GMT, Aleksey Shipilev wrote: >> When profiling CTW runs, I noticed we spend a lot of time dealing with deoptimization. We do this excessively, deoptimizing before compilation on every tier. This is excessive: Hotspot honors compilation requests on subsequent levels without the need for explicit deoptimization. Not doing deopt between tiers greatly improves CTW performance. >> >> A taste of improvements, about 15% less CPU spent: >> >> >> $ time make test TEST=applications/ctw/modules >> >> # Current >> real 5m1.616s >> user 79m41.398s >> sys 14m39.607s >> >> # Patched >> real 3m55.411s >> user 69m19.227s >> sys 5m24.323s >> >> >> The compilation still works as expected, progressing through tiers 1..4: >> >> >> $ JAVA_OPTIONS="-XX:+PrintCompilation -XX:CICompilerCount=2" ./ctw.sh modules:jdk.compiler | tee out >> ... >> $ grep sun.tools.serialver.resources.serialver_de::getContents out >> 101783 55033 b 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101785 55036 b 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101786 55033 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101786 55038 b 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101787 55036 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101792 55040 b 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101797 55038 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101798 55040 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: marked for deoptimization > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java > > Co-authored-by: Tobias Hartmann Thanks @shipilev! I really welcome any change that makes CTW a bit faster ? Looks good to me. ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/26013#pullrequestreview-2973784592 From xgong at openjdk.org Tue Jul 1 07:02:48 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 07:02:48 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References: Message-ID: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> On Mon, 30 Jun 2025 13:25:09 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - cleanup: address nits, rename several symbols > - cleanup: remove unreferenced definitions > - Address review comments. > > - fixup: disable FP mul reduction auto-vectorization for all targets > - fixup: add a tmp vReg to reduce_mul_integral_gt128b and > reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified > - cleanup: replace a complex lambda in the above methods with a loop > - cleanup: rename symbols to follow the existing naming convention > - cleanup: add asserts to SVE only instructions > - split mul FP reduction instructions into strictly-ordered (default) > and explicitly non strictly-ordered > - remove redundant conditions in TestVectorFPReduction.java > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > | Benchmark | Before | After | Units | Diff | > |---------------------------|----------|----------|--------|-------| > | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | > | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | > | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | > | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | > | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | > | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% | > - Merge branch 'master' into 8343689-rebase > - fixup: don't modify the value in vsrc > > Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this > change, the result of recursive folding is held in vtmp1. To be able to > pass this intermediate result to reduce_mul_integral_le128b(), we would > have to use another temporary FloatRegister, as vtmp1 would essentially > act as vsrc. It's possible to get around this however: > reduce_mul_integral_le128b() is modified so it's possible to pass > matching vsrc and vtmp2 arguments. By doing this, we save ourselves a > temporary register in rules that match to reduce_mul_integral_gt128b(). > - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating > - Use EXT instead of COMPACT to split a vector into two halves > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > Short... src/hotspot/cpu/aarch64/aarch64_vector.ad line 3536: > 3534: > 3535: instruct reduce_mulF_gt128b(vRegF dst, vRegF fsrc, vReg vsrc, vReg tmp) %{ > 3536: predicate(Matcher::vector_length_in_bytes(n->in(2)) > 16 && n->as_Reduction()->requires_strict_order()); Are there the cases that can match with this rule? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2097: > 2095: sve_movprfx(vtmp1, vsrc); // copy > 2096: sve_ext(vtmp1, vtmp1, vector_length_in_bytes / 2); // swap halves > 2097: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); // multiply halves > sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); Can we use `ptrue` instread of `pgtmp` here? The higher bits can be computed, but they have not influences to the final results, right? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2106: > 2104: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vtmp2); // multiply halves > 2105: vector_length_in_bytes = vector_length_in_bytes / 2; > 2106: vector_length = vector_length / 2; I guess you want to update the `pgtmp` with new `vector_length`? But seems the code is missing. Anyway, maybe the it's not necessary to generate a predicate as I commented above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176590314 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176584327 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2176587011 From xgong at openjdk.org Tue Jul 1 07:10:41 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 07:10:41 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: <7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> References: <7-WqNSzjPLOsHJ4DHogxqbiInl8TIz5sxIEXbIfo2OQ=.912568b8-830d-47cc-a837-46af6be618f3@github.com> Message-ID: On Tue, 1 Jul 2025 06:41:32 GMT, Xiaohong Gong wrote: >> Ping again! Thanks in advance! > >> @XiaohongGong I'm a little busy at the moment, and soon going on a summer vacation, so I cannot promise a full review soon. Feel free to ask someone else to have a look. >> >> I quickly looked through your new benchmark results you published after integration of #25539. There seem to still be a few cases where `Gain < 1`. Especially: >> >> ``` >> GatherOperationsBenchmark.microShortGather512_MASK 256 thrpt 30 ops/ms 11587.465 10674.598 0.92 >> GatherOperationsBenchmark.microShortGather512_MASK 1024 thrpt 30 ops/ms 2902.731 2629.739 0.90 >> GatherOperationsBenchmark.microShortGather512_MASK 4096 thrpt 30 ops/ms 741.546 671.124 0.90 >> ``` >> >> and >> >> ``` >> GatherOperationsBenchmark.microShortGather256_MASK 256 thrpt 30 ops/ms 11339.217 10951.141 0.96 >> GatherOperationsBenchmark.microShortGather256_MASK 1024 thrpt 30 ops/ms 2840.081 2718.823 0.95 >> GatherOperationsBenchmark.microShortGather256_MASK 4096 thrpt 30 ops/ms 725.334 696.343 0.96 >> ``` >> >> and >> >> ``` >> GatherOperationsBenchmark.microByteGather512_MASK 64 thrpt 30 ops/ms 50588.210 48220.741 0.95 >> ``` >> >> Do you know what happens in those cases? > > Thanks for your input! Yes, I spent some time making an analysis on these little regressions. Seems there are the architecture HW influences like the cache miss or code alignment. I tried with a larger loop alignment like 32, and the performance will be improved and regressions are gone. Since I'm not quite familiar with X86 architectures, I'm not sure of the exact point. Any suggestions on that? > @XiaohongGong Maybe someone from Intel (@jatin-bhateja @sviswa7) can help you with the x86 specific issues. You could always use hardware counters to measure cache misses. Also if the vectors are not cache-line aligned, there may be split loads or stores. Also that can be measured with hardware counters. Maybe the benchmark needs to be improved somehow, to account for issues with alignment. I also tried to measure cache misses with perf on my x86 machine, and I noticed the cache miss is increased. The generated code layout of the test/benchmark is changed with my changes in Java side, so I guess maybe the alignment is different with before. To verify my thought, I used the vm option `-XX:OptoLoopAlignment=32`, and the performance can be improved a lot compared with the version without my change. So I think the patch itself maybe acceptable even we noticed minor regressions. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3022195040 From bmaillard at openjdk.org Tue Jul 1 07:11:42 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 1 Jul 2025 07:11:42 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v2] In-Reply-To: <0MJe_8nA-ILWqoVG-9rzuq5Pe9xX-FG2LN3k9Cy8nqU=.d724c6cf-cb02-45c4-95a4-5bd1fef7462b@github.com> References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> <3cLLB7fms3S4WgqOVeb7D_ZDRFsJ_-ca3qfALlmzFeU=.1002ac91-1e35-4499-9d88-6d1f76c955d0@github.com> <0MJe_8nA-ILWqoVG-9rzuq5Pe9xX-FG2LN3k9Cy8nqU=.d724c6cf-cb02-45c4-95a4-5bd1fef7462b@github.com> Message-ID: On Mon, 30 Jun 2025 13:52:01 GMT, Emanuel Peter wrote: > @benoitmaillard Very nice work, and great description :) Thank you! > > Did you check if this allows enabling any of the other disabled verifications from [JDK-8347273](https://bugs.openjdk.org/browse/JDK-8347273)? > > That may be a lot of work. Not sure if it is worth checking all of them now. @TobiHartmann how much should he invest in this now? An alternative is just tackling all the other cases later. What do you think? I have started to take a look at this and it seems that there are a lot of cases to check indeed. > @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? Yes, good point, I should I have mentioned this somewhere. The `phase->type(in(2))` call uses the type array from `PhaseValues`. The type array entry is actually modified earlier, in `PhaseCCP::analyze`, right after the `Value` call. You can see the `set_type` call [here](https://github.com/benoitmaillard/jdk/blob/75de51dff6d9cc3e9764737b29b9358992b488b7/src/hotspot/share/opto/phaseX.cpp#L2765). When this happens, users are added to the (local) worklist but again it does not change our issue as only value optimizations occur in that context. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26017#issuecomment-3022192988 From shade at openjdk.org Tue Jul 1 07:41:40 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 07:41:40 GMT Subject: RFR: 8360783: CTW: Skip deoptimization between tiers [v2] In-Reply-To: References: Message-ID: On Fri, 27 Jun 2025 08:38:31 GMT, Aleksey Shipilev wrote: >> When profiling CTW runs, I noticed we spend a lot of time dealing with deoptimization. We do this excessively, deoptimizing before compilation on every tier. This is excessive: Hotspot honors compilation requests on subsequent levels without the need for explicit deoptimization. Not doing deopt between tiers greatly improves CTW performance. >> >> A taste of improvements, about 15% less CPU spent: >> >> >> $ time make test TEST=applications/ctw/modules >> >> # Current >> real 5m1.616s >> user 79m41.398s >> sys 14m39.607s >> >> # Patched >> real 3m55.411s >> user 69m19.227s >> sys 5m24.323s >> >> >> The compilation still works as expected, progressing through tiers 1..4: >> >> >> $ JAVA_OPTIONS="-XX:+PrintCompilation -XX:CICompilerCount=2" ./ctw.sh modules:jdk.compiler | tee out >> ... >> $ grep sun.tools.serialver.resources.serialver_de::getContents out >> 101783 55033 b 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101785 55036 b 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101786 55033 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101786 55038 b 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101787 55036 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101792 55040 b 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101797 55038 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101798 55040 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: marked for deoptimization > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java > > Co-authored-by: Tobias Hartmann Thanks! Here goes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26013#issuecomment-3022327111 From thartmann at openjdk.org Tue Jul 1 07:47:40 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 1 Jul 2025 07:47:40 GMT Subject: RFR: 8360783: CTW: Skip deoptimization between tiers [v2] In-Reply-To: References: Message-ID: On Fri, 27 Jun 2025 08:38:31 GMT, Aleksey Shipilev wrote: >> When profiling CTW runs, I noticed we spend a lot of time dealing with deoptimization. We do this excessively, deoptimizing before compilation on every tier. This is excessive: Hotspot honors compilation requests on subsequent levels without the need for explicit deoptimization. Not doing deopt between tiers greatly improves CTW performance. >> >> A taste of improvements, about 15% less CPU spent: >> >> >> $ time make test TEST=applications/ctw/modules >> >> # Current >> real 5m1.616s >> user 79m41.398s >> sys 14m39.607s >> >> # Patched >> real 3m55.411s >> user 69m19.227s >> sys 5m24.323s >> >> >> The compilation still works as expected, progressing through tiers 1..4: >> >> >> $ JAVA_OPTIONS="-XX:+PrintCompilation -XX:CICompilerCount=2" ./ctw.sh modules:jdk.compiler | tee out >> ... >> $ grep sun.tools.serialver.resources.serialver_de::getContents out >> 101783 55033 b 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101785 55036 b 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101786 55033 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101786 55038 b 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101787 55036 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101792 55040 b 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101797 55038 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101798 55040 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: marked for deoptimization > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java > > Co-authored-by: Tobias Hartmann Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26013#pullrequestreview-2973981844 From shade at openjdk.org Tue Jul 1 08:02:45 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 08:02:45 GMT Subject: RFR: 8360783: CTW: Skip deoptimization between tiers [v2] In-Reply-To: References: Message-ID: On Fri, 27 Jun 2025 08:38:31 GMT, Aleksey Shipilev wrote: >> When profiling CTW runs, I noticed we spend a lot of time dealing with deoptimization. We do this excessively, deoptimizing before compilation on every tier. This is excessive: Hotspot honors compilation requests on subsequent levels without the need for explicit deoptimization. Not doing deopt between tiers greatly improves CTW performance. >> >> A taste of improvements, about 15% less CPU spent: >> >> >> $ time make test TEST=applications/ctw/modules >> >> # Current >> real 5m1.616s >> user 79m41.398s >> sys 14m39.607s >> >> # Patched >> real 3m55.411s >> user 69m19.227s >> sys 5m24.323s >> >> >> The compilation still works as expected, progressing through tiers 1..4: >> >> >> $ JAVA_OPTIONS="-XX:+PrintCompilation -XX:CICompilerCount=2" ./ctw.sh modules:jdk.compiler | tee out >> ... >> $ grep sun.tools.serialver.resources.serialver_de::getContents out >> 101783 55033 b 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101785 55036 b 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101786 55033 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101786 55038 b 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101787 55036 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101792 55040 b 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) >> 101797 55038 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used >> 101798 55040 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: marked for deoptimization > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Update test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java > > Co-authored-by: Tobias Hartmann Aw. Thanks! Here goes again. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26013#issuecomment-3022410574 From shade at openjdk.org Tue Jul 1 08:02:45 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 08:02:45 GMT Subject: Integrated: 8360783: CTW: Skip deoptimization between tiers In-Reply-To: References: Message-ID: On Fri, 27 Jun 2025 08:19:34 GMT, Aleksey Shipilev wrote: > When profiling CTW runs, I noticed we spend a lot of time dealing with deoptimization. We do this excessively, deoptimizing before compilation on every tier. This is excessive: Hotspot honors compilation requests on subsequent levels without the need for explicit deoptimization. Not doing deopt between tiers greatly improves CTW performance. > > A taste of improvements, about 15% less CPU spent: > > > $ time make test TEST=applications/ctw/modules > > # Current > real 5m1.616s > user 79m41.398s > sys 14m39.607s > > # Patched > real 3m55.411s > user 69m19.227s > sys 5m24.323s > > > The compilation still works as expected, progressing through tiers 1..4: > > > $ JAVA_OPTIONS="-XX:+PrintCompilation -XX:CICompilerCount=2" ./ctw.sh modules:jdk.compiler | tee out > ... > $ grep sun.tools.serialver.resources.serialver_de::getContents out > 101783 55033 b 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) > 101785 55036 b 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) > 101786 55033 1 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used > 101786 55038 b 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) > 101787 55036 2 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used > 101792 55040 b 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) > 101797 55038 3 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: not used > 101798 55040 4 sun.tools.serialver.resources.serialver_de::getContents (108 bytes) made not entrant: marked for deoptimization This pull request has now been integrated. Changeset: cd6caedd Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/cd6caedd0a3c9ebd4c8c57e64f62b60161c5cd7c Stats: 8 lines in 1 file changed: 6 ins; 1 del; 1 mod 8360783: CTW: Skip deoptimization between tiers Reviewed-by: thartmann, mhaessig, dfenacci ------------- PR: https://git.openjdk.org/jdk/pull/26013 From eastigeevich at openjdk.org Tue Jul 1 08:08:49 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 08:08:49 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: <0TjXtL5ABEBUwmu1VlJ9kNDs95zi8HGA-S2A0BU9GeY=.2fa893f4-96c4-4761-91b9-3b6250212c7a@github.com> On Thu, 26 Jun 2025 16:20:44 GMT, Chad Rakoczy wrote: >> src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp line 90: >> >>> 88: // Patch the constant in the call's trampoline stub. >>> 89: address trampoline_stub_addr = get_trampoline(); >>> 90: if (trampoline_stub_addr != nullptr && dest != trampoline_stub_addr) { >> >> I think you will not need the checks if you rewrite the code as follows: >> ```c++ >> address addr_call = ...; >> assert(); >> >> if (!Assembler::reachable_from_branch_at(addr_call, dest)) { >> address trampoline_stub_addr = get_trampoline(); >> assert (trampoline_stub_addr != nullptr, "we need a trampoline"); >> assert (! is_NativeCallTrampolineStub_at(dest), "chained trampolines"); >> nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(dest); >> dest = trampoline_stub_addr; >> } >> set_destination(dest); >> ICache::invalidate_range(addr_call, instruction_size); >> >> >> If `dest` is a trampoline in the current nmethod, it is always reachable. So you will not go into setting trampoline's target to itself. Also we will call `get_trampoline`, which involves `CodeCache::find_blob` and ` a traversal of relocations, only if we need a trampoline. > > I would need to check the assumptions that other callers make about this function. In the current state it updates the trampoline regardless if the branch is reachable or not. With your change it would require the caller to also update the trampoline to make sure it is not stale. @theRealAph When we don't need a trampoline (a call site is a direct call), we update the trampoline to have the same destination as the call site. I have not found places in Hotspot relying on this. Do you remember why we are doing this? Is it Ok not to update trampolines in the case of reachable destinations? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176748370 From aph at openjdk.org Tue Jul 1 08:13:41 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:13:41 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 05:59:15 GMT, Xiaohong Gong wrote: > ### Background > On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. > > For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. > > To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. > > ### Impact Analysis > #### 1. Vector types > Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. > > #### 2. Vector API > No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. > > #### 3. Auto-vectorization > Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. > > #### 4. Codegen of vector nodes > NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. > > Details: > - Lanewise vector operations are unaffected as explained above. > - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). > - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_s... src/hotspot/cpu/aarch64/aarch64.ad line 2371: > 2369: switch(bt) { > 2370: case T_BOOLEAN: > 2371: // It needs to load/store a vector mask with only 2 elements Suggestion: // Load/store a vector mask with only 2 elements Same with the other cases. src/hotspot/cpu/aarch64/aarch64.ad line 2386: > 2384: break; > 2385: default: > 2386: // Limit the min vector length to 64-bit normally. Suggestion: // Limit the min vector length to 64-bit. src/hotspot/cpu/aarch64/aarch64_vector.ad line 199: > 197: case Op_MaxReductionV: > 198: // Reductions with less than 8 bytes vector length are > 199: // not supported for now. Suggestion: // not supported. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2176759967 PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2176761846 PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2176762709 From aph at openjdk.org Tue Jul 1 08:30:48 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:30:48 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: <0TjXtL5ABEBUwmu1VlJ9kNDs95zi8HGA-S2A0BU9GeY=.2fa893f4-96c4-4761-91b9-3b6250212c7a@github.com> References: <0TjXtL5ABEBUwmu1VlJ9kNDs95zi8HGA-S2A0BU9GeY=.2fa893f4-96c4-4761-91b9-3b6250212c7a@github.com> Message-ID: On Tue, 1 Jul 2025 08:05:50 GMT, Evgeny Astigeevich wrote: > @theRealAph When we don't need a trampoline (a call site is a direct call), we update the trampoline to have the same destination as the call site. Yes, that's fundamental to the design. > I have not found places in Hotspot relying on this. Do you remember why we are doing this? Is it Ok not to update trampolines in the case of reachable destinations? No. We always keep the trampoline up to date so that we don't have to deal with a race condition when patching trampoline calls. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176812614 From aph at openjdk.org Tue Jul 1 08:34:48 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:34:48 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: <0TjXtL5ABEBUwmu1VlJ9kNDs95zi8HGA-S2A0BU9GeY=.2fa893f4-96c4-4761-91b9-3b6250212c7a@github.com> Message-ID: On Tue, 1 Jul 2025 08:28:00 GMT, Andrew Haley wrote: >> @theRealAph When we don't need a trampoline (a call site is a direct call), we update the trampoline to have the same destination as the call site. I have not found places in Hotspot relying on this. >> Do you remember why we are doing this? Is it Ok not to update trampolines in the case of reachable destinations? > >> @theRealAph When we don't need a trampoline (a call site is a direct call), we update the trampoline to have the same destination as the call site. > > Yes, that's fundamental to the design. > >> I have not found places in Hotspot relying on this. Do you remember why we are doing this? Is it Ok not to update trampolines in the case of reachable destinations? > > No. We always keep the trampoline up to date so that we don't have to deal with a race condition when patching trampoline calls. Please read the comments which begin: `AArch64 OpenJDK uses four different types of calls:` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176825935 From xgong at openjdk.org Tue Jul 1 08:35:42 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 1 Jul 2025 08:35:42 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 08:10:16 GMT, Andrew Haley wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > src/hotspot/cpu/aarch64/aarch64.ad line 2371: > >> 2369: switch(bt) { >> 2370: case T_BOOLEAN: >> 2371: // It needs to load/store a vector mask with only 2 elements > > Suggestion: > > // Load/store a vector mask with only 2 elements > > Same with the other cases. Thanks so much for your comment. I will fix them soon. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2176831961 From aph at openjdk.org Tue Jul 1 08:40:55 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:40:55 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e src/hotspot/cpu/aarch64/relocInfo_aarch64.cpp line 84: > 82: if (NativeCall::is_call_at(addr())) { > 83: NativeCall* call = nativeCall_at(addr()); > 84: if (be_safe) { Why is this change necessary? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176847208 From aph at openjdk.org Tue Jul 1 08:44:51 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 1 Jul 2025 08:44:51 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e src/hotspot/cpu/aarch64/relocInfo_aarch64.cpp line 117: > 115: } > 116: > 117: void poll_Relocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest, bool is_nmethod_relocation) { Suggestion: void poll_Relocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest, bool) { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2176861287 From mhaessig at openjdk.org Tue Jul 1 09:11:32 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 09:11:32 GMT Subject: RFR: 8308094: Add a compilation timeout flag to catch long running compilations [v2] In-Reply-To: References: Message-ID: <_Ye19u_7PlqlsoRSuR0dNeAGbeuHyN_oqD1ZS4q9Nvk=.b94fd29d-d43e-4561-9926-7f5a46434d8e@github.com> > This PR adds `-XX:CompileTaskTimeout` on Linux to limit the amount of time a compilation task can run. The goal of this is initially to be able to find and investigate long-running compilations. > > The timeout is implemented using a POSIX timer that sends a `SIGALRM` to the compiler thread the compile task is running on. Each compiler thread registers a signal handler that triggers an assert upon receiving `SIGALRM`. This is currently only implemented for Linux, because it relies on `SIGEV_THREAD_ID` to get the signal delivered to the same thread that timed out. > > Since `SIGALRM` is now used, the test `runtime/signal/TestSigalrm.java` now requires `vm.flagless` so it will not interfere with the compiler thread signal handlers. > > Testing: > - [ ] Github Actions > - [x] tier1, tier2 on all platforms > - [x] tier3, tier4 and Oracle internal testing on Linux fastdebug > - [x] tier1 through tier4 with `-XX:CompileTaskTimeout=60000` (one minute timeout) to see what fails (`compiler/codegen/TestAntiDependenciesHighMemUsage2.java`, `compiler/loopopts/TestMaxLoopOptsCountReached.java`, and `compiler/c2/TestScalarReplacementMaxLiveNodes.java` fail) Manuel H?ssig has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8308094-timeout - Fix SIGALRM test - Add timeout functionality to compiler threads ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26023/files - new: https://git.openjdk.org/jdk/pull/26023/files/09e0e58c..5840cc2e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26023&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26023&range=00-01 Stats: 4936 lines in 244 files changed: 2913 ins; 773 del; 1250 mod Patch: https://git.openjdk.org/jdk/pull/26023.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26023/head:pull/26023 PR: https://git.openjdk.org/jdk/pull/26023 From duke at openjdk.org Tue Jul 1 09:20:43 2025 From: duke at openjdk.org (duke) Date: Tue, 1 Jul 2025 09:20:43 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt [v2] In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: On Fri, 27 Jun 2025 01:43:16 GMT, Mohamed Issa wrote: >> The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. >> >> 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. >> 2. If these special values are found, return immediately with minimal modifications to the result register. >> 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). >> >> The commands to run all relevant micro-benchmarks are posted below. >> >> `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` >> `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` >> >> The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. >> >> Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. >> >> | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | >> | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | >> | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | >> | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | >> | [0] | 344990 | 627561 | +81.91 | >> | [-0] ... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Ensure ABS_MASK is a 128-bit memory sized location and only use equal enum for UCOMISD checks @missa-prime Your change (at version 615169d8aa679c665ac4c5ad30ea011505e503b7) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25962#issuecomment-3022902863 From mhaessig at openjdk.org Tue Jul 1 09:34:50 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 09:34:50 GMT Subject: RFR: 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string In-Reply-To: References: Message-ID: <5kdHAQ86j5eDq6OgIb6Bn7HFWxgc24W8ywubudeGa-Q=.5d8b392a-de5c-49d7-a3f2-3ade541c6643@github.com> On Mon, 30 Jun 2025 16:14:08 GMT, Kim Barrett wrote: > Please review this trivial fix of a format string. The value being printed is > TieredStopAtLevel, which is of type intx, so "%zd" should be used instead of "%d". > > Testing: mach5 tier1 Looks good and trivial to me. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26051#pullrequestreview-2974517185 From yzheng at openjdk.org Tue Jul 1 09:38:48 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Tue, 1 Jul 2025 09:38:48 GMT Subject: RFR: 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 16:14:08 GMT, Kim Barrett wrote: > Please review this trivial fix of a format string. The value being printed is > TieredStopAtLevel, which is of type intx, so "%zd" should be used instead of "%d". > > Testing: mach5 tier1 LGTM ------------- Marked as reviewed by yzheng (Committer). PR Review: https://git.openjdk.org/jdk/pull/26051#pullrequestreview-2974540441 From eastigeevich at openjdk.org Tue Jul 1 09:51:55 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 09:51:55 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e src/hotspot/share/code/nmethod.cpp line 1547: > 1545: CodeBuffer dst(nm_copy); > 1546: while (iter.next()) { > 1547: iter.reloc()->fix_relocation_after_move(&src, &dst, true); What if, instead of a bool parameter we introduce a function `fix_relocation_after_copy`: ```c++ virtual void Relocation::fix_relocation_after_copy(const CodeBuffer* src, CodeBuffer* dest) { fix_relocation_after_move(src, dest); } void CallRelocation::fix_relocation_after_copy(const CodeBuffer* src, CodeBuffer* dest) { address orig_addr = old_addr_for(addr(), src, dest); address callee = pd_call_destination(orig_addr); if (src->contains(callee)) { // If the original call is to an address in the src CodeBuffer (such as a stub call) // the updated call should be to the corresponding address in dest CodeBuffer ptrdiff_t offset = callee - orig_addr; callee = addr() + offset; } pd_set_call_destination(callee); } With this change we don't need to modify `relocInfo_*.cpp` files. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177056209 From jbhateja at openjdk.org Tue Jul 1 10:13:22 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 10:13:22 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 Message-ID: Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 Changes: https://git.openjdk.org/jdk/pull/26062/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361037 Stats: 11 lines in 1 file changed: 11 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26062.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26062/head:pull/26062 PR: https://git.openjdk.org/jdk/pull/26062 From eastigeevich at openjdk.org Tue Jul 1 10:19:57 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 10:19:57 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 09:49:08 GMT, Evgeny Astigeevich wrote: >> Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: >> >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Update how call sites are fixed >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Fix pointer printing >> - Use set_destination_mt_safe >> - Print address as pointer >> - Use new _metadata_size instead of _jvmci_data_size >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Only check branch distance for aarch64 and riscv >> - Move far branch fix to fix_relocation_after_move >> - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e > > src/hotspot/share/code/nmethod.cpp line 1547: > >> 1545: CodeBuffer dst(nm_copy); >> 1546: while (iter.next()) { >> 1547: iter.reloc()->fix_relocation_after_move(&src, &dst, true); > > What if, instead of a bool parameter we introduce a function `fix_relocation_after_copy`: > ```c++ > virtual void Relocation::fix_relocation_after_copy(const CodeBuffer* src, CodeBuffer* dest) { > fix_relocation_after_move(src, dest); > } > > void CallRelocation::fix_relocation_after_copy(const CodeBuffer* src, CodeBuffer* dest) { > address orig_addr = old_addr_for(addr(), src, dest); > address callee = pd_call_destination(orig_addr); > > if (src->contains(callee)) { > // If the original call is to an address in the src CodeBuffer (such as a stub call) > // the updated call should be to the corresponding address in dest CodeBuffer > ptrdiff_t offset = callee - orig_addr; > callee = addr() + offset; > } > > pd_set_call_destination(callee); > } > > > With this change we don't need to modify `relocInfo_*.cpp` files. IMO, we might consider moving `pd_set_call_destination` to `CallRelocation` because only CallRelocation uses it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177119955 From shade at openjdk.org Tue Jul 1 10:53:25 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 10:53:25 GMT Subject: RFR: 8361180: Disable CompiledDirectCall verification with -VerifyInlineCaches Message-ID: Missed the spot when doing [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). There is a path from GC that calls into IC verification when cleaning the caches. See `nmethod::cleanup_inline_caches_impl`. It does verification per callsite, and does the whole thing during parallel GC cleanup, which is STW at least in G1. This gets expensive for CTW scenarios. We should wrap that under the same flag introduced by [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). Motivational improvements: $ time CONF=linux-x86_64-server-fastdebug make test TEST=applications/ctw/modules/ # Current mainline real 3m59.274s user 68m9.663s sys 5m19.026s # This PR real 3m49.118s user 65m37.962s sys 5m15.441s ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/26063/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26063&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361180 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26063.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26063/head:pull/26063 PR: https://git.openjdk.org/jdk/pull/26063 From mhaessig at openjdk.org Tue Jul 1 11:23:42 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 11:23:42 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:08:20 GMT, Jatin Bhateja wrote: > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin Hi, @jatin-bhateja. Thank you for providing this fix. I took a look at it and have a question. Otherwise, this looks good. src/hotspot/share/opto/divnode.cpp line 833: > 831: } > 832: > 833: if (g_isfinite(t1->getf()) && t2->getf() == 0.0) { Is the `g_isfinite` for `t1` really needed? If the dividend is infinite then the result is also an infinity with the appropriate sign. Does this not result in `INF / 0.0` being calculated below? This would also be undefined by the C++ standard, would it not? Since as far as I know not all s390 models implement IEEE754, perhaps it would be better to remove the `g_isfinite` to prevent the native `INF / 0.0` below. ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26062#pullrequestreview-2974972341 PR Review Comment: https://git.openjdk.org/jdk/pull/26062#discussion_r2177311121 From eastigeevich at openjdk.org Tue Jul 1 11:26:52 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 11:26:52 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: <73AnlXOv0T8K25DgsNdH1PkBjcBXz0f3bBYZx44LpAw=.439f5383-ffd1-44e8-9e11-4b5af9b6a278@github.com> On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e src/hotspot/share/code/nmethod.cpp line 1653: > 1651: } > 1652: } > 1653: } Do we need this code? Shouldn't missing trampolined be caught during fixing call sites? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177325220 From eastigeevich at openjdk.org Tue Jul 1 11:40:54 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 11:40:54 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e src/hotspot/share/code/nmethod.hpp line 172: > 170: friend class DeoptimizationScope; > 171: > 172: #define ImmutableDataReferencesCounterSize (int)sizeof(int) Macros defining an expression need to be enclosed in parentheses. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177369434 From epeter at openjdk.org Tue Jul 1 11:56:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 11:56:43 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v3] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: On Mon, 30 Jun 2025 15:42:01 GMT, Beno?t Maillard wrote: >> This PR prevents some missed ideal optimizations in IGVN by notifying users of type refinements made during CCP, addressing a missed optimization that caused a verification failure with `-XX:VerifyIterativeGVN=1110`. >> >> ### Context >> During the compilation of the input program (obtained from the fuzzer, then simplified and added as a test) by C2, we end up with node `591 ModI` that takes `138 Phi` as its divisor input. An existing `Ideal` optimization is to get rid of the control input of a `ModINode` when we can prove that the divisor is never `0`. >> >> In this specific case, the type of the `PhiNode` gets refined during CCP, but the refinement fails to propagate to its users for the IGVN phase and the ideal optimization for the `ModINode` never happens. This results in a missed optimization and hits an assert in the verification phase of IGVN (when using `-XX:VerifyIterativeGVN=1110`). >> >> ![IGV screenshot](https://github.com/user-attachments/assets/5dee1ae6-9146-4115-922d-df33b7ccbd37) >> >> ### Detailed Analysis >> >> In `PhaseCCP::analyze`, we call `Value` for the `PhiNode`, which >> results in a type refinement: the range gets restricted to `int:-13957..-1191`. >> >> ```c++ >> // Pull from worklist; compute new value; push changes out. >> // This loop is the meat of CCP. >> while (worklist.size() != 0) { >> Node* n = fetch_next_node(worklist); >> DEBUG_ONLY(worklist_verify.push(n);) >> if (n->is_SafePoint()) { >> // Make sure safepoints are processed by PhaseCCP::transform even if they are >> // not reachable from the bottom. Otherwise, infinite loops would be removed. >> _root_and_safepoints.push(n); >> } >> const Type* new_type = n->Value(this); >> if (new_type != type(n)) { >> DEBUG_ONLY(verify_type(n, new_type, type(n));) >> dump_type_and_node(n, new_type); >> set_type(n, new_type); >> push_child_nodes_to_worklist(worklist, n); >> } >> if (KillPathsReachableByDeadTypeNode && n->is_Type() && new_type == Type::TOP) { >> // Keep track of Type nodes to kill CFG paths that use Type >> // nodes that become dead. >> _maybe_top_type_nodes.push(n); >> } >> } >> DEBUG_ONLY(verify_analyze(worklist_verify);) >> >> >> At the end of `PhaseCCP::analyze`, we obtain the following types in the side table: >> - `int` for node `591` (`ModINode`) >> - `int:-13957..-1191` for node `138` (`PhiNode`) >> >> If we call `find_node(138)->bottom_type()`, we get: >> - `int` for both nodes >> >> The... > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > Fix bad test class name Nice work @benoitmaillard ! src/hotspot/share/opto/phaseX.cpp line 3124: > 3122: n->raise_bottom_type(t); > 3123: _worklist.push(n); // n re-enters the hash table via the worklist > 3124: add_users_to_worklist(n); // if ideal or identity optimizations depend on the input type, users need to be notified Suggestion: add_users_to_worklist(n); // if Ideal or Identity optimizations depend on the input type, users need to be notified I would make them upper-case, just like the method names. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26017#pullrequestreview-2975094882 PR Review Comment: https://git.openjdk.org/jdk/pull/26017#discussion_r2177396474 From epeter at openjdk.org Tue Jul 1 11:56:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 11:56:44 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v2] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> <3cLLB7fms3S4WgqOVeb7D_ZDRFsJ_-ca3qfALlmzFeU=.1002ac91-1e35-4499-9d88-6d1f76c955d0@github.com> <0MJe_8nA-ILWqoVG-9rzuq5Pe9xX-FG2LN3k9Cy8nqU=.d724c6cf-cb02-45c4-95a4-5bd1fef7462b@github.com> Message-ID: On Tue, 1 Jul 2025 07:07:40 GMT, Beno?t Maillard wrote: > > @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? > > Yes, good point, I should I have mentioned this somewhere. The `phase->type(in(2))` call uses the type array from `PhaseValues`. The type array entry is actually modified earlier, in `PhaseCCP::analyze`, right after the `Value` call. You can see the `set_type` call [here](https://github.com/benoitmaillard/jdk/blob/75de51dff6d9cc3e9764737b29b9358992b488b7/src/hotspot/share/opto/phaseX.cpp#L2765). When this happens, users are added to the (local) worklist but again it does not change our issue as only value optimizations occur in that context. Thanks for the explanation! So it seems that `CCP` and `IGVN` share the type array, right? Ah yes, it is the `Compile::_types`: 461 // Shared type array for GVN, IGVN and CCP. It maps node idx -> Type*. 462 Type_Array* _types; If the value behind `phase->type(in(2))` (the type array entry) is modified in `PhaseCCP::analyze`, right after the `Value` call, then why not do the notification there? If we did that, we would do more notification than what you now proposed (to do the notification in `PhaseCCP::transform_once` on the nodes that have a type that is different than the `bottom_type`). Are we possibly missing any important case with your approach now? Probably not, I would argue: with your approach we still notify for all live nodes that have a modified type, or are replaced with a constant. If we notified after every type update in `PhaseCCP::analyze`, we might notify for nodes multiple times, and we would also notify for nodes that are dead after CCP - both are unnecessary overheads. Alright, I just wanted to think this through - but it seems your approach is good :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26017#issuecomment-3023637471 From bmaillard at openjdk.org Tue Jul 1 12:00:18 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 1 Jul 2025 12:00:18 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal Message-ID: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. ### Testing - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) - [x] tier1-3, plus some internal testing Thank you for reviewing! ------------- Commit messages: - 8361144: remove unintentional line break - 8361144: move hash check after return value check and use same format as unique counter check - 8361144: add check for node hash after verifying ideal Changes: https://git.openjdk.org/jdk/pull/26064/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26064&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361144 Stats: 11 lines in 1 file changed: 10 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26064.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26064/head:pull/26064 PR: https://git.openjdk.org/jdk/pull/26064 From epeter at openjdk.org Tue Jul 1 12:02:16 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 1 Jul 2025 12:02:16 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v28] In-Reply-To: References: <7r3C8BAViyHKVVJjv4w0YxfIUkfk9PmY0OEt73V_aRI=.baf51fc4-d996-44d0-a1f5-10cf6dc4de8d@github.com> Message-ID: On Thu, 12 Jun 2025 15:40:49 GMT, Roland Westrelin wrote: >> @rwestrel Let me know if you want us to run some extra testing. Christian said that you might be planning to wait until the JDK26 fork, and merge then, and then we can run testing. Up to you :) > > @eme64 in case you forgot about that one, it's ready for another round of reviews. @rwestrel I'm quite busy right now. I will soon go on vacation and travel, and I have a presentation to prepare in the next weeks. I hope I can come back to this in early August though. Feel free to ask someone else for a review, I don't want to hold this up. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-3023679612 From eastigeevich at openjdk.org Tue Jul 1 12:07:58 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 12:07:58 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 22:32:24 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - Print address as pointer > - Use new _metadata_size instead of _jvmci_data_size > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Only check branch distance for aarch64 and riscv > - Move far branch fix to fix_relocation_after_move > - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e test/hotspot/jtreg/vmTestbase/nsk/jvmti/NMethodRelocation/nmethodrelocation.java line 37: > 35: import jdk.test.whitebox.code.BlobType; > 36: > 37: public class nmethodrelocation extends DebugeeClass { Why is the class name not following the Java code conventions? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2177424604 From mbaesken at openjdk.org Tue Jul 1 12:28:39 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 1 Jul 2025 12:28:39 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:08:20 GMT, Jatin Bhateja wrote: > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin With your patch included, the test compiler/c2/irTests/TestFloat16ScalarOperations.java now passes on macOS aarch64 with ubsan enabled. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26062#issuecomment-3023799985 From shade at openjdk.org Tue Jul 1 12:33:51 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 12:33:51 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code Message-ID: We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. Additional testing: - [ ] GHA - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` - [ ] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) ------------- Commit messages: - Revert separate patch - Final - Proper option name and bump the limits - Fix Changes: https://git.openjdk.org/jdk/pull/26068/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26068&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360557 Stats: 15 lines in 3 files changed: 15 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26068.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26068/head:pull/26068 PR: https://git.openjdk.org/jdk/pull/26068 From shade at openjdk.org Tue Jul 1 12:46:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 12:46:38 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 12:26:44 GMT, Aleksey Shipilev wrote: > We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. > > There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. > > After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:C1MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. > > Pre-empting the question: "Well, why not use -Xcomp then, and make sure it inlines well?" The answer is in RFE as well: Xcomp causes _a lot_ of stray compilations for JDK and CTW infra itself. For small JARs in large corpus this eats precious testing time that we would instead like to spend on deeper inlining in the actual JAR code. This also does not force us to look into how CTW works in Xcomp at all; I expect some surprises there. Feather-touching the inlining heuristic paths to just accept methods without looking at profiles looks better. > > Tobias had an idea to implement the stress randomized inlining that would expand the scope of inlining. This improvement stacks well with it. This improvement provides the base case of inlining most reasonable methods, and then allow stress infra to inline some more on top of that. > > Additional testing: > - [ ] GHA > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > - [x] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) We are on par for CTW testing time, comparing to the state a week back: # Before CTW perf improvements real 5m0.528s user 79m5.193s sys 14m16.678s # Current mainline real 3m59.274s user 68m9.663s sys 5m19.026s # This PR real 4m56.248s user 89m48.364s sys 5m24.091s ------------- PR Comment: https://git.openjdk.org/jdk/pull/26068#issuecomment-3023863192 From mbaesken at openjdk.org Tue Jul 1 12:48:19 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 1 Jul 2025 12:48:19 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 14:44:03 GMT, Manuel H?ssig wrote: > `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. > > Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. > > Testing: > - [x] Github Actions > - [x] tier1, tier2 plus Oracle internal testing > - [x] `TestRedundantLea.java` on Alpine Linux With your patch included, the issue is gone on our Linux Alpine test machine. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26046#issuecomment-3023856713 From mhaessig at openjdk.org Tue Jul 1 12:48:19 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 12:48:19 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 12:42:05 GMT, Matthias Baesken wrote: >> `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. >> >> Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. >> >> Testing: >> - [x] Github Actions >> - [x] tier1, tier2 plus Oracle internal testing >> - [x] `TestRedundantLea.java` on Alpine Linux > > With your patch included, the issue is gone on our Linux Alpine test machine. @MBaesken, thank you for testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26046#issuecomment-3023862806 From mhaessig at openjdk.org Tue Jul 1 12:48:19 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 1 Jul 2025 12:48:19 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules Message-ID: `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. Testing: - [x] Github Actions - [x] tier1, tier2 plus Oracle internal testing - [x] `TestRedundantLea.java` on Alpine Linux ------------- Commit messages: - Fix test Changes: https://git.openjdk.org/jdk/pull/26046/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26046&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361040 Stats: 12 lines in 1 file changed: 2 ins; 6 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26046.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26046/head:pull/26046 PR: https://git.openjdk.org/jdk/pull/26046 From bmaillard at openjdk.org Tue Jul 1 12:58:29 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 1 Jul 2025 12:58:29 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v4] In-Reply-To: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: > This PR prevents some missed ideal optimizations in IGVN by notifying users of type refinements made during CCP, addressing a missed optimization that caused a verification failure with `-XX:VerifyIterativeGVN=1110`. > > ### Context > During the compilation of the input program (obtained from the fuzzer, then simplified and added as a test) by C2, we end up with node `591 ModI` that takes `138 Phi` as its divisor input. An existing `Ideal` optimization is to get rid of the control input of a `ModINode` when we can prove that the divisor is never `0`. > > In this specific case, the type of the `PhiNode` gets refined during CCP, but the refinement fails to propagate to its users for the IGVN phase and the ideal optimization for the `ModINode` never happens. This results in a missed optimization and hits an assert in the verification phase of IGVN (when using `-XX:VerifyIterativeGVN=1110`). > > ![IGV screenshot](https://github.com/user-attachments/assets/5dee1ae6-9146-4115-922d-df33b7ccbd37) > > ### Detailed Analysis > > In `PhaseCCP::analyze`, we call `Value` for the `PhiNode`, which > results in a type refinement: the range gets restricted to `int:-13957..-1191`. > > ```c++ > // Pull from worklist; compute new value; push changes out. > // This loop is the meat of CCP. > while (worklist.size() != 0) { > Node* n = fetch_next_node(worklist); > DEBUG_ONLY(worklist_verify.push(n);) > if (n->is_SafePoint()) { > // Make sure safepoints are processed by PhaseCCP::transform even if they are > // not reachable from the bottom. Otherwise, infinite loops would be removed. > _root_and_safepoints.push(n); > } > const Type* new_type = n->Value(this); > if (new_type != type(n)) { > DEBUG_ONLY(verify_type(n, new_type, type(n));) > dump_type_and_node(n, new_type); > set_type(n, new_type); > push_child_nodes_to_worklist(worklist, n); > } > if (KillPathsReachableByDeadTypeNode && n->is_Type() && new_type == Type::TOP) { > // Keep track of Type nodes to kill CFG paths that use Type > // nodes that become dead. > _maybe_top_type_nodes.push(n); > } > } > DEBUG_ONLY(verify_analyze(worklist_verify);) > > > At the end of `PhaseCCP::analyze`, we obtain the following types in the side table: > - `int` for node `591` (`ModINode`) > - `int:-13957..-1191` for node `138` (`PhiNode`) > > If we call `find_node(138)->bottom_type()`, we get: > - `int` for both nodes > > There is no progress on the type of `ModINode` during CCP, because `ModINode::Value` > is not able to... Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: 8359602: update case for consistency Co-authored-by: Emanuel Peter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26017/files - new: https://git.openjdk.org/jdk/pull/26017/files/75de51df..005b2825 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26017&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26017&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26017.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26017/head:pull/26017 PR: https://git.openjdk.org/jdk/pull/26017 From bmaillard at openjdk.org Tue Jul 1 13:18:40 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 1 Jul 2025 13:18:40 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v2] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> <3cLLB7fms3S4WgqOVeb7D_ZDRFsJ_-ca3qfALlmzFeU=.1002ac91-1e35-4499-9d88-6d1f76c955d0@github.com> <0MJe_8nA-ILWqoVG-9rzuq5Pe9xX-FG2LN3k9Cy8nqU=.d724c6cf-cb02-45c4-95a4-5bd1fef7462b@github.com> Message-ID: On Tue, 1 Jul 2025 07:07:40 GMT, Beno?t Maillard wrote: >> @benoitmaillard Very nice work, and great description :) >> >>>Did you check if this allows enabling any of the other disabled verifications from [JDK-8347273](https://bugs.openjdk.org/browse/JDK-8347273)? >> >> That may be a lot of work. Not sure if it is worth checking all of them now. @TobiHartmann how much should he invest in this now? An alternative is just tackling all the other cases later. What do you think? >> >> @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? > >> @benoitmaillard Very nice work, and great description :) > > Thank you! @eme64 > >> > Did you check if this allows enabling any of the other disabled verifications from [JDK-8347273](https://bugs.openjdk.org/browse/JDK-8347273)? >> >> That may be a lot of work. Not sure if it is worth checking all of them now. @TobiHartmann how much should he invest in this now? An alternative is just tackling all the other cases later. What do you think? > > I have started to take a look at this and it seems that there are a lot of cases to check indeed. > >> @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? > > Yes, good point, I should I have mentioned this somewhere. The `phase->type(in(2))` call uses the type array from `PhaseValues`. The type array entry is actually modified earlier, in `PhaseCCP::analyze`, right after the `Value` call. You can see the `set_type` call [here](https://github.com/benoitmaillard/jdk/blob/75de51dff6d9cc3e9764737b29b9358992b488b7/src/hotspot/share/opto/phaseX.cpp#L2765). When this happens, users are added to the (local) worklist but again it does not change our issue as only value optimizations occur in that context. > > > @benoitmaillard One more open question for me: `raise_bottom_type` only sets the node internal `_type`. But in IGVN, we do not read from `_type` but `phase->type(in(2))`. Do you know when the `phase->type(in(2))` value changes? Is that also during CCP? Before or after the `_type` is modified? > > > > > > Yes, good point, I should I have mentioned this somewhere. The `phase->type(in(2))` call uses the type array from `PhaseValues`. The type array entry is actually modified earlier, in `PhaseCCP::analyze`, right after the `Value` call. You can see the `set_type` call [here](https://github.com/benoitmaillard/jdk/blob/75de51dff6d9cc3e9764737b29b9358992b488b7/src/hotspot/share/opto/phaseX.cpp#L2765). When this happens, users are added to the (local) worklist but again it does not change our issue as only value optimizations occur in that context. > > Thanks for the explanation! So it seems that `CCP` and `IGVN` share the type array, right? Ah yes, it is the `Compile::_types`: > > ``` > 461 // Shared type array for GVN, IGVN and CCP. It maps node idx -> Type*. > 462 Type_Array* _types; > ``` > > If the value behind `phase->type(in(2))` (the type array entry) is modified in `PhaseCCP::analyze`, right after the `Value` call, then why not do the notification there? If we did that, we would do more notification than what you now proposed (to do the notification in `PhaseCCP::transform_once` on the nodes that have a type that is different than the `bottom_type`). Are we possibly missing any important case with your approach now? Probably not, I would argue: with your approach we still notify for all live nodes that have a modified type, or are replaced with a constant. If we notified after every type update in `PhaseCCP::analyze`, we might notify for nodes multiple times, and we would also notify for nodes that are dead after CCP - both are unnecessary overheads. Alright, I just wanted to think this through - but it seems your approach is good :) I also considered doing it there in `PhaseCCP::analyze`, but I reached the same conclusion. Thanks for your help! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26017#issuecomment-3023978823 From snatarajan at openjdk.org Tue Jul 1 13:27:47 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 1 Jul 2025 13:27:47 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v7] In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 16:24:03 GMT, Vladimir Kozlov wrote: >> Saranya Natarajan has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> merge with master >> Merge branch 'master' of https://git.openjdk.org/jdk into JDK-8325478 > > src/hotspot/share/opto/compile.cpp line 2533: > >> 2531: { >> 2532: TracePhase tp(_t_macroExpand); >> 2533: print_method(PHASE_BEFORE_MACRO_EXPANSION, 3); > > Should we move it before `mex.expand_macro_nodes()` call? Moving this would break the assumption of needing a `BEFORE_MACRO_ELIMINATION` as explained in the above reply. One way to go about this would be to include a `BEFORE_MACRO_ELIMINATION` phase and remove the `PHASE_BEFORE_MACRO_EXPANSION` phase as this is only place where it is used. Would this be a reasonable fix ? > src/hotspot/share/opto/phasetype.hpp line 94: > >> 92: flags(AFTER_LOOP_OPTS, "After Loop Optimizations") \ >> 93: flags(AFTER_MERGE_STORES, "After Merge Stores") \ >> 94: flags(AFTER_MACRO_ELIMINATION_STEP, "After Macro Elimination Step") \ > > What is the reason to not have `BEFORE_MACRO_ELIMINATION`? The two main reasons for not having a `BEFORE_MACRO_ELIMINATION` are as follows: - There is a dump in line 2426 (`print_method(PHASE_ITER_GVN_AFTER_EA, 2)`) before we call `mexp.eliminate_macro_nodes` which performs the functionality of having a `BEFORE_MACRO_ELIMINATION` for phase dump. - There is dump in line 2533 (`print_method(PHASE_BEFORE_MACRO_EXPANSION, 3)`) before eliminating macro nodes which performs the similar function. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25682#discussion_r2177603003 PR Review Comment: https://git.openjdk.org/jdk/pull/25682#discussion_r2177602894 From jbhateja at openjdk.org Tue Jul 1 13:28:21 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 13:28:21 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v2] In-Reply-To: References: Message-ID: > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolution ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26062/files - new: https://git.openjdk.org/jdk/pull/26062/files/bf78fbe6..d39c76f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=00-01 Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26062.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26062/head:pull/26062 PR: https://git.openjdk.org/jdk/pull/26062 From jbhateja at openjdk.org Tue Jul 1 13:28:22 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 13:28:22 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 11:19:04 GMT, Manuel H?ssig wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Review comments resolution > > src/hotspot/share/opto/divnode.cpp line 833: > >> 831: } >> 832: >> 833: if (g_isfinite(t1->getf()) && t2->getf() == 0.0) { > > Is the `g_isfinite` for `t1` really needed? If the dividend is infinite then the result is also an infinity with the appropriate sign. Does this not result in `INF / 0.0` being calculated below? This would also be undefined by the C++ standard, would it not? Since as far as I know not all s390 models implement IEEE754, perhaps it would be better to remove the `g_isfinite` to prevent the native `INF / 0.0` below. As per C++ standard section 7.6.5 (expr.mul), behavior is undefined only if the second operand is 0.0. In all other situations, we can expect a standard-compliant C++ compiler to generate code following IEEE 754 semantics, irrespective of target floating point model, but Java semantics expect to return a NaN value if either of the operands is a NaN. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26062#discussion_r2177604366 From jbhateja at openjdk.org Tue Jul 1 13:36:20 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 1 Jul 2025 13:36:20 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v3] In-Reply-To: References: Message-ID: <6cWhCvx8g-Gx4VoBHW1wA7atsa_Eq5wBhkDolUbP_X0=.31f8e688-7401-4f81-9b50-46b1997e96b5@github.com> > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Adding comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26062/files - new: https://git.openjdk.org/jdk/pull/26062/files/d39c76f4..0038654e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26062&range=01-02 Stats: 8 lines in 1 file changed: 8 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26062.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26062/head:pull/26062 PR: https://git.openjdk.org/jdk/pull/26062 From galder at openjdk.org Tue Jul 1 13:47:38 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 1 Jul 2025 13:47:38 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal In-Reply-To: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Tue, 1 Jul 2025 11:35:06 GMT, Beno?t Maillard wrote: > This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. > > By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) > - [x] tier1-3, plus some internal testing > > Thank you for reviewing! Have you considered adding a test for this? Is that feasible? ------------- PR Review: https://git.openjdk.org/jdk/pull/26064#pullrequestreview-2975520753 From eastigeevich at openjdk.org Tue Jul 1 15:33:53 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 15:33:53 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 Message-ID: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. This PR adds a requirement for the test to be run on debug builds only. Tested: - Fastdebug: test passed - Slowdebug: test passed. - Release: test skipped. ------------- Commit messages: - 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 Changes: https://git.openjdk.org/jdk/pull/26072/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360936 Stats: 3 lines in 2 files changed: 1 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From missa at openjdk.org Tue Jul 1 15:37:47 2025 From: missa at openjdk.org (Mohamed Issa) Date: Tue, 1 Jul 2025 15:37:47 GMT Subject: Integrated: 8358179: Performance regression in Math.cbrt In-Reply-To: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: <12bHfivFgRF2s-Sr0SZY6DIywI30LQ63uedYzsncO0A=.ba272456-15df-493b-8247-e38a67796968@github.com> On Tue, 24 Jun 2025 22:33:56 GMT, Mohamed Issa wrote: > The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. > > 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. > 2. If these special values are found, return immediately with minimal modifications to the result register. > 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). > > The commands to run all relevant micro-benchmarks are posted below. > > `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` > `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` > > The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. > > Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. > > | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | > | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | > | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | > | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | > | [0] | 344990 | 627561 | +81.91 | > | [-0] | 291... This pull request has now been integrated. Changeset: 38f59f84 Author: Mohamed Issa Committer: Sandhya Viswanathan URL: https://git.openjdk.org/jdk/commit/38f59f84c98dfd974eec0c05541b2138b149def7 Stats: 50 lines in 1 file changed: 11 ins; 36 del; 3 mod 8358179: Performance regression in Math.cbrt Reviewed-by: sviswanathan, sparasa, epeter ------------- PR: https://git.openjdk.org/jdk/pull/25962 From sviswanathan at openjdk.org Tue Jul 1 15:37:46 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 1 Jul 2025 15:37:46 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: On Mon, 30 Jun 2025 05:51:58 GMT, Emanuel Peter wrote: >>> I'll hold off with approval until someone else who is more knowledgeable has reviewed first. But feel free to ping me for a second review. >> >> @eme64 Second review with the latest changes? > > @missa-prime The patch still looks good, though I ran testing again because of the new changes. Should complete in about 24h. Thanks a lot @eme64. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25962#issuecomment-3024541704 From shade at openjdk.org Tue Jul 1 15:39:43 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 15:39:43 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 15:29:10 GMT, Evgeny Astigeevich wrote: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. > > This PR adds a requirement for the test to be run on debug builds only. > > Tested: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test skipped. Looks okay, but I am confused why the test did not fail before JDK-8359435? test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 32: > 30: * @requires vm.flagless > 31: * @requires os.arch=="aarch64" > 32: * @requires vm.debug==true Can be just `@requires vm.debug`. ------------- PR Review: https://git.openjdk.org/jdk/pull/26072#pullrequestreview-2975983374 PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2177921439 From phh at openjdk.org Tue Jul 1 15:41:41 2025 From: phh at openjdk.org (Paul Hohensee) Date: Tue, 1 Jul 2025 15:41:41 GMT Subject: RFR: 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 06:39:18 GMT, Boris Ulasevich wrote: > This change addresses an intermittent crash in CompileBroker::print_heapinfo() when accessing JVMCI metadata after a CodeBlob::purge(). > > The issue is a regression after: > - JDK-8343789: JVMCI metadata was moved from nmethod into a separate blob. > - JDK-8352112: CodeBlob::purge() was updated to set _mutable_data to blob_end(). > > The change zeroes out _mutable_data_size, _relocation_size, and _metadata_size in purge() so that after purge jvmci_data_size() returns 0 and CompileBroker::print_heapinfo() won?t touch an invalid _metadata. Marked as reviewed by phh (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25608#pullrequestreview-2975990062 From mablakatov at openjdk.org Tue Jul 1 15:48:00 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 15:48:00 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v5] In-Reply-To: References: Message-ID: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms > IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms > LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms > > > Fujitsu A64FX (SVE 512-bit): > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms > IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms > LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: fixup: remove undefined insts from aarch64-asmtest.py ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23181/files - new: https://git.openjdk.org/jdk/pull/23181/files/025d5166..df09ab65 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=03-04 Stats: 30 lines in 2 files changed: 0 ins; 9 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/23181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181 PR: https://git.openjdk.org/jdk/pull/23181 From kvn at openjdk.org Tue Jul 1 15:48:42 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 15:48:42 GMT Subject: RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 19:45:49 GMT, Ashutosh Mehra wrote: > Please reivew this patch to fix initialization and freeing of `AOTCodeAddressTable::_stubs_addr`. Changes are trivial I missed that this is for mainline. Approved. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26053#pullrequestreview-2976010588 From kvn at openjdk.org Tue Jul 1 15:52:37 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 15:52:37 GMT Subject: RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 19:45:49 GMT, Ashutosh Mehra wrote: > Please reivew this patch to fix initialization and freeing of `AOTCodeAddressTable::_stubs_addr`. Changes are trivial Yes, it is trivial. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26053#issuecomment-3024597730 From kvn at openjdk.org Tue Jul 1 15:59:38 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 15:59:38 GMT Subject: RFR: 8361180: Disable CompiledDirectCall verification with -VerifyInlineCaches In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:47:40 GMT, Aleksey Shipilev wrote: > Missed the spot when doing [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). There is a path from GC that calls into IC verification when cleaning the caches. See `nmethod::cleanup_inline_caches_impl`. It does verification per callsite, and does the whole thing during parallel GC cleanup, which is STW at least in G1. This gets expensive for CTW scenarios. We should wrap that under the same flag introduced by [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). > > Motivational improvements: > > > $ time CONF=linux-x86_64-server-fastdebug make test TEST=applications/ctw/modules/ > > # Current mainline > real 3m59.274s > user 68m9.663s > sys 5m19.026s > > # This PR > real 3m49.118s > user 65m37.962s > sys 5m15.441s Trivial. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26063#pullrequestreview-2976063372 From shade at openjdk.org Tue Jul 1 15:59:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 15:59:39 GMT Subject: RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 19:45:49 GMT, Ashutosh Mehra wrote: > Please reivew this patch to fix initialization and freeing of `AOTCodeAddressTable::_stubs_addr`. Changes are trivial Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26053#pullrequestreview-2976066699 From eastigeevich at openjdk.org Tue Jul 1 16:05:07 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 16:05:07 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. > > This PR adds a requirement for the test to be run on debug builds only. > > Tested: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test skipped. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Simplify requirement for debug build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/b2ba0a92..e91036bc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From kvn at openjdk.org Tue Jul 1 16:06:39 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 16:06:39 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 12:26:44 GMT, Aleksey Shipilev wrote: > We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. > > There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. > > After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:C1MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. > > Pre-empting the question: "Well, why not use -Xcomp then, and make sure it inlines well?" The answer is in RFE as well: Xcomp causes _a lot_ of stray compilations for JDK and CTW infra itself. For small JARs in large corpus this eats precious testing time that we would instead like to spend on deeper inlining in the actual JAR code. This also does not force us to look into how CTW works in Xcomp at all; I expect some surprises there. Feather-touching the inlining heuristic paths to just accept methods without looking at profiles looks better. > > Tobias had an idea to implement the stress randomized inlining that would expand the scope of inlining. This improvement stacks well with it. This improvement provides the base case of inlining most reasonable methods, and then allow stress infra to inline some more on top of that. > > Additional testing: > - [x] GHA > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > - [x] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) This has to be tested by us to make sure we clean up all issues this change find. ------------- PR Review: https://git.openjdk.org/jdk/pull/26068#pullrequestreview-2976094320 From mablakatov at openjdk.org Tue Jul 1 16:10:49 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 16:10:49 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> References: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> Message-ID: <19rf4A0bxc4BstRmLivGkoCOm7Qa7YD6z1VJHJivCtg=.4a643c7b-4e79-4f37-b230-7231df3c68a8@github.com> On Tue, 1 Jul 2025 06:57:10 GMT, Xiaohong Gong wrote: >> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - cleanup: address nits, rename several symbols >> - cleanup: remove unreferenced definitions >> - Address review comments. >> >> - fixup: disable FP mul reduction auto-vectorization for all targets >> - fixup: add a tmp vReg to reduce_mul_integral_gt128b and >> reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified >> - cleanup: replace a complex lambda in the above methods with a loop >> - cleanup: rename symbols to follow the existing naming convention >> - cleanup: add asserts to SVE only instructions >> - split mul FP reduction instructions into strictly-ordered (default) >> and explicitly non strictly-ordered >> - remove redundant conditions in TestVectorFPReduction.java >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> | Benchmark | Before | After | Units | Diff | >> |---------------------------|----------|----------|--------|-------| >> | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | >> | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | >> | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | >> | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | >> | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | >> | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% | >> - Merge branch 'master' into 8343689-rebase >> - fixup: don't modify the value in vsrc >> >> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this >> change, the result of recursive folding is held in vtmp1. To be able to >> pass this intermediate result to reduce_mul_integral_le128b(), we would >> have to use another temporary FloatRegister, as vtmp1 would essentially >> act as vsrc. It's possible to get around this however: >> reduce_mul_integral_le128b() is modified so it's possible to pass >> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a >> temporary register in rules that match to reduce_mul_integral_gt128b(). >> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating >> - Use EXT instead of COMPACT to split a vector into two halves >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master ... > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2097: > >> 2095: sve_movprfx(vtmp1, vsrc); // copy >> 2096: sve_ext(vtmp1, vtmp1, vector_length_in_bytes / 2); // swap halves >> 2097: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); // multiply halves > >> sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); > > Can we use `ptrue` instread of `pgtmp` here? The higher bits can be computed, but they have not influences to the final results, right? Thanks! For some reason I thought that we don't have a dedicated predicate register for that. > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2106: > >> 2104: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vtmp2); // multiply halves >> 2105: vector_length_in_bytes = vector_length_in_bytes / 2; >> 2106: vector_length = vector_length / 2; > > I guess you want to update the `pgtmp` with new `vector_length`? But seems the code is missing. Anyway, maybe the it's not necessary to generate a predicate as I commented above. It isn't exactly necessary similarly to how we can always use `ptrue` here. But yeah, I'll just remove it following the suggestion above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178009839 PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178007165 From mchevalier at openjdk.org Tue Jul 1 16:14:00 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 1 Jul 2025 16:14:00 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v2] In-Reply-To: References: Message-ID: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> > When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 > > This is enforced by restoring the old state, like in > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 > > That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: > > ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) > > > Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. > > Another situation is somewhat worse, when happening during parsing. It can lead to such cases: > > ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) > > The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? > > This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 > > And here there is the challenge: > - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) > - we can't really change the pointer, just the content > -... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Remove useless loop ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25936/files - new: https://git.openjdk.org/jdk/pull/25936/files/54b07e94..d51853ca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=00-01 Stats: 24 lines in 1 file changed: 0 ins; 2 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/25936.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25936/head:pull/25936 PR: https://git.openjdk.org/jdk/pull/25936 From mablakatov at openjdk.org Tue Jul 1 16:14:47 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 16:14:47 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> References: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> Message-ID: On Tue, 1 Jul 2025 07:00:08 GMT, Xiaohong Gong wrote: >> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: >> >> - cleanup: address nits, rename several symbols >> - cleanup: remove unreferenced definitions >> - Address review comments. >> >> - fixup: disable FP mul reduction auto-vectorization for all targets >> - fixup: add a tmp vReg to reduce_mul_integral_gt128b and >> reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified >> - cleanup: replace a complex lambda in the above methods with a loop >> - cleanup: rename symbols to follow the existing naming convention >> - cleanup: add asserts to SVE only instructions >> - split mul FP reduction instructions into strictly-ordered (default) >> and explicitly non strictly-ordered >> - remove redundant conditions in TestVectorFPReduction.java >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> | Benchmark | Before | After | Units | Diff | >> |---------------------------|----------|----------|--------|-------| >> | ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% | >> | DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% | >> | FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% | >> | IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% | >> | LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% | >> | ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% | >> - Merge branch 'master' into 8343689-rebase >> - fixup: don't modify the value in vsrc >> >> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this >> change, the result of recursive folding is held in vtmp1. To be able to >> pass this intermediate result to reduce_mul_integral_le128b(), we would >> have to use another temporary FloatRegister, as vtmp1 would essentially >> act as vsrc. It's possible to get around this however: >> reduce_mul_integral_le128b() is modified so it's possible to pass >> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a >> temporary register in rules that match to reduce_mul_integral_gt128b(). >> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating >> - Use EXT instead of COMPACT to split a vector into two halves >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master ... > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 3536: > >> 3534: >> 3535: instruct reduce_mulF_gt128b(vRegF dst, vRegF fsrc, vReg vsrc, vReg tmp) %{ >> 3536: predicate(Matcher::vector_length_in_bytes(n->in(2)) > 16 && n->as_Reduction()->requires_strict_order()); > > Are there the cases that can match with this rule? Well, we don't match it right now for auto-vectorization as it doesn't worth it performance-wise. This might change for future implementations of SVE(2). I'd still prefer to keep it so the set of instructions is complete. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178014966 From mablakatov at openjdk.org Tue Jul 1 16:14:49 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 16:14:49 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v5] In-Reply-To: References: Message-ID: <4XhaHrk4r0mgFmgfVUFvy0mktRz25oXfbln2Nhjcxg4=.a7e60853-979f-48de-9fa0-b8530a3b2ba5@github.com> On Tue, 1 Jul 2025 02:51:56 GMT, Hao Sun wrote: >> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: >> >> fixup: remove undefined insts from aarch64-asmtest.py > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3729: > >> 3727: #undef INSN >> 3728: >> 3729: // SVE aliases > > In the inital commit, asm test for `sve_(mov|movs|not|nots)` is added into `test/hotspot/gtest/aarch64/aarch64-asmtest.py`. Since the definition is removed in this commit, the corresponding asm test should be removed as well. Otherwise, JDK build failed on AArch64. > See the error log in GHA test. https://github.com/mikabl-arm/jdk/actions/runs/15974069085/job/45051902618 Thanks, fixed by https://github.com/openjdk/jdk/pull/23181/commits/df09ab65f75c7b6f99e0088b3871d7df7a8c4d1b ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178016339 From mablakatov at openjdk.org Tue Jul 1 16:25:49 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Tue, 1 Jul 2025 16:25:49 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 06:21:43 GMT, Xiaohong Gong wrote: >> Why is it better that way? Currently the assertions check that we end up here if there computations that can be done only using SVE (length > neon && length <= sve). What would happen if a user operates 256b VectorAPI vectors on a 512b SVE platform? > > That would be the operations with partial vector size valid. For such cases, we will generate a mask in IR level, and a `VectorBlend` will be generated for this reduction case. Otherwise the result will be incorrect. So the vector size should be equal to MaxVectorSize theoretically. Thank you for elaborating on this :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178035000 From shade at openjdk.org Tue Jul 1 16:27:42 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 16:27:42 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code In-Reply-To: References: Message-ID: <3T_kZY0tk0WcS4kkuGcoifEHjo1TlLbLBcjLxb4sD-I=.42bd833a-7fa2-4173-a165-f05e05e6e124@github.com> On Tue, 1 Jul 2025 16:04:12 GMT, Vladimir Kozlov wrote: > This has to be tested by us to make sure we clean up all issues this change find. Sure thing. There is a chicken-and-egg kind of problem that some bugs reproduce only with this PR, and maybe with extra inline tuning :) I am following up on failures that we are seeing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26068#issuecomment-3024727152 From snatarajan at openjdk.org Tue Jul 1 16:28:27 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 1 Jul 2025 16:28:27 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v8] In-Reply-To: References: Message-ID: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> > This changeset restructures the macro expansion phase to not include macro elimination and also adds a flag StressMacroElimination which randomizes macro elimination ordering for stress testing purposes. > > Changes: > - Implemented a method `eliminate_opaque_looplimit_macro_nodes` that removes the functionality for eliminating Opaque and LoopLimit nodes from the `expand_macro_nodes ` method. > - Introduced compiler phases` PHASE_AFTER_MACRO_ELIMINATION` > - Added a new Ideal phase for individual macro elimination steps. > - Implemented the flag `StressMacroElimination`. Added functionality tests for `StressMacroElimination`, similar to previous stress flag `StressMacroExpansion` ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)). > > Below is a sample screenshot (IGV print level 4 ) mainly showing the new phase . > ![image](https://github.com/user-attachments/assets/16013cd4-6ec6-4939-ac66-33bb03d59cd6) > > Questions to reviewers: > - Is the new macro elimination phase OK, or should we change anything? > - In `compile.cpp `, `PHASE_ITER_GVN_AFTER_ELIMINATION` follows `PHASE_AFTER_MACRO_ELIMINATION` in the current fix. Should `PHASE_ITER_GVN_AFTER_ELIMINATION` be removed ? > > Testing: > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: review comments fix part 1 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25682/files - new: https://git.openjdk.org/jdk/pull/25682/files/939be78b..791b6a0c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25682&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25682&range=06-07 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/25682.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25682/head:pull/25682 PR: https://git.openjdk.org/jdk/pull/25682 From eastigeevich at openjdk.org Tue Jul 1 16:43:38 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 16:43:38 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 15:37:04 GMT, Aleksey Shipilev wrote: > Looks okay, but I am confused why the test did not fail before JDK-8359435? Just checked. It's not because of JDK-8359435. There were some changes which disabled printing debug info in release build. > test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 32: > >> 30: * @requires vm.flagless >> 31: * @requires os.arch=="aarch64" >> 32: * @requires vm.debug==true > > Can be just `@requires vm.debug`. Done ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3024772962 PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2178059281 From eastigeevich at openjdk.org Tue Jul 1 16:43:39 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 1 Jul 2025 16:43:39 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 16:05:07 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. >> >> This PR adds a requirement for the test to be run on debug builds only. >> >> Tested: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test skipped. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Simplify requirement for debug build The test started failing after I had updated my branch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3024774351 From kvn at openjdk.org Tue Jul 1 16:54:42 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 16:54:42 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v4] In-Reply-To: References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: On Tue, 1 Jul 2025 06:52:32 GMT, Manuel H?ssig wrote: >> After integrating #25872 the calculation of the`CICompilerCount` ergonomic became dependent on the size of `NonNMethodCodeHeapSize`, which itself is an ergonomic based on the available memory. Thus, depending on the system, the test `compiler/arguments/TestCompilerCounts.java` failed, i.e. locally this failed, but not on CI servers. >> >> This PR changes the test to reflect the changes introduced in #25872. >> >> Testing: >> - [ ] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15932906313) >> - [x] tier1,tier2 plus Oracle internal testing > > Manuel H?ssig has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace > > Co-authored-by: Andrey Turbanov Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26024#pullrequestreview-2976217571 From kvn at openjdk.org Tue Jul 1 17:08:43 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 17:08:43 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v7] In-Reply-To: References: Message-ID: <0KyjZgLy8vVqV3du6Y1LIKmGTnDYxEPlYgTrVVd_ey4=.b2d40e6c-ff0e-4e88-bc70-e06219a15608@github.com> On Tue, 1 Jul 2025 13:24:49 GMT, Saranya Natarajan wrote: >> src/hotspot/share/opto/compile.cpp line 2533: >> >>> 2531: { >>> 2532: TracePhase tp(_t_macroExpand); >>> 2533: print_method(PHASE_BEFORE_MACRO_EXPANSION, 3); >> >> Should we move it before `mex.expand_macro_nodes()` call? > > Moving this would break the assumption of needing a `BEFORE_MACRO_ELIMINATION` as explained in the above reply. One way to go about this would be to include a `BEFORE_MACRO_ELIMINATION` phase and remove the `PHASE_BEFORE_MACRO_EXPANSION` phase as this is only place where it is used. Would this be a reasonable fix ? So `MACRO_ELIMINATION` is subset of `MACRO_EXPANSION` >> src/hotspot/share/opto/phasetype.hpp line 94: >> >>> 92: flags(AFTER_LOOP_OPTS, "After Loop Optimizations") \ >>> 93: flags(AFTER_MERGE_STORES, "After Merge Stores") \ >>> 94: flags(AFTER_MACRO_ELIMINATION_STEP, "After Macro Elimination Step") \ >> >> What is the reason to not have `BEFORE_MACRO_ELIMINATION`? > > The two main reasons for not having a `BEFORE_MACRO_ELIMINATION` are as follows: > - There is a dump in line 2426 (`print_method(PHASE_ITER_GVN_AFTER_EA, 2)`) before we call `mexp.eliminate_macro_nodes` which performs the functionality of having a `BEFORE_MACRO_ELIMINATION` for phase dump. > - There is dump in line 2533 (`print_method(PHASE_BEFORE_MACRO_EXPANSION, 3)`) before eliminating macro nodes which performs the similar function. ok ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25682#discussion_r2178120635 PR Review Comment: https://git.openjdk.org/jdk/pull/25682#discussion_r2178120168 From kvn at openjdk.org Tue Jul 1 17:08:41 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 1 Jul 2025 17:08:41 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v8] In-Reply-To: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> References: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> Message-ID: On Tue, 1 Jul 2025 16:28:27 GMT, Saranya Natarajan wrote: >> This changeset restructures the macro expansion phase to not include macro elimination and also adds a flag StressMacroElimination which randomizes macro elimination ordering for stress testing purposes. >> >> Changes: >> - Implemented a method `eliminate_opaque_looplimit_macro_nodes` that removes the functionality for eliminating Opaque and LoopLimit nodes from the `expand_macro_nodes ` method. >> - Introduced compiler phases` PHASE_AFTER_MACRO_ELIMINATION` >> - Added a new Ideal phase for individual macro elimination steps. >> - Implemented the flag `StressMacroElimination`. Added functionality tests for `StressMacroElimination`, similar to previous stress flag `StressMacroExpansion` ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)). >> >> Below is a sample screenshot (IGV print level 4 ) mainly showing the new phase . >> ![image](https://github.com/user-attachments/assets/16013cd4-6ec6-4939-ac66-33bb03d59cd6) >> >> Questions to reviewers: >> - Is the new macro elimination phase OK, or should we change anything? >> - In `compile.cpp `, `PHASE_ITER_GVN_AFTER_ELIMINATION` follows `PHASE_AFTER_MACRO_ELIMINATION` in the current fix. Should `PHASE_ITER_GVN_AFTER_ELIMINATION` be removed ? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > review comments fix part 1 Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25682#pullrequestreview-2976289575 From shade at openjdk.org Tue Jul 1 17:13:43 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 1 Jul 2025 17:13:43 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 16:05:07 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. >> >> This PR adds a requirement for the test to be run on debug builds only. >> >> Tested: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test skipped. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Simplify requirement for debug build OK, are you able to bisect which change? This fix to only do debug VM needs to be correctly linked to the actual cause, IMO. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3024882710 From psandoz at openjdk.org Tue Jul 1 18:06:45 2025 From: psandoz at openjdk.org (Paul Sandoz) Date: Tue, 1 Jul 2025 18:06:45 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 09:16:48 GMT, Xiaohong Gong wrote: >> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below: >> >> Benchmark Mode Cnt Unit SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.31... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Address review comments > - Merge 'jdk:master' into JDK-8355563 > - 8355563: VectorAPI: Refactor current implementation of subword gather load API Marked as reviewed by psandoz (Reviewer). This is a nice simplification, Java changes look good. I'll let the Intel folks sign-off related to regressions. IMO minor regressions like this are acceptable if the generated code quality is good, and if the benchmark reports higher variance and averaging results from multiple forks close the gap. (In this case i don't understand how the Java changes impacts alignment). ------------- PR Review: https://git.openjdk.org/jdk/pull/25138#pullrequestreview-2976493924 PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3025029477 From dlunden at openjdk.org Tue Jul 1 18:08:40 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 1 Jul 2025 18:08:40 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v8] In-Reply-To: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> References: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> Message-ID: On Tue, 1 Jul 2025 16:28:27 GMT, Saranya Natarajan wrote: >> This changeset restructures the macro expansion phase to not include macro elimination and also adds a flag StressMacroElimination which randomizes macro elimination ordering for stress testing purposes. >> >> Changes: >> - Implemented a method `eliminate_opaque_looplimit_macro_nodes` that removes the functionality for eliminating Opaque and LoopLimit nodes from the `expand_macro_nodes ` method. >> - Introduced compiler phases` PHASE_AFTER_MACRO_ELIMINATION` >> - Added a new Ideal phase for individual macro elimination steps. >> - Implemented the flag `StressMacroElimination`. Added functionality tests for `StressMacroElimination`, similar to previous stress flag `StressMacroExpansion` ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)). >> >> Below is a sample screenshot (IGV print level 4 ) mainly showing the new phase . >> ![image](https://github.com/user-attachments/assets/16013cd4-6ec6-4939-ac66-33bb03d59cd6) >> >> Questions to reviewers: >> - Is the new macro elimination phase OK, or should we change anything? >> - In `compile.cpp `, `PHASE_ITER_GVN_AFTER_ELIMINATION` follows `PHASE_AFTER_MACRO_ELIMINATION` in the current fix. Should `PHASE_ITER_GVN_AFTER_ELIMINATION` be removed ? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > review comments fix part 1 Marked as reviewed by dlunden (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25682#pullrequestreview-2976500092 From sviswanathan at openjdk.org Tue Jul 1 21:33:44 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 1 Jul 2025 21:33:44 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 09:16:48 GMT, Xiaohong Gong wrote: >> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below: >> >> Benchmark Mode Cnt Unit SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.31... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Address review comments > - Merge 'jdk:master' into JDK-8355563 > - 8355563: VectorAPI: Refactor current implementation of subword gather load API Marked as reviewed by sviswanathan (Reviewer). Agree with Paul, these are minor regressions. Let us proceed with this patch. ------------- PR Review: https://git.openjdk.org/jdk/pull/25138#pullrequestreview-2977019367 PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3025596784 From sviswanathan at openjdk.org Wed Jul 2 00:04:39 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 2 Jul 2025 00:04:39 GMT Subject: RFR: 8360116: Add support for AVX10 floating point minmax instruction [v5] In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 08:38:27 GMT, Jatin Bhateja wrote: >> Intel@ AVX10 ISA [1] extensions added new floating point MIN/MAX instructions which comply with definitions in IEEE-754-2019 standard section 9.6 and can directly emulate Math.min/max semantics without the need for any special handling for NaN, +0.0 or -0.0 detection. >> >> **The following pseudo-code describes the existing algorithm for min/max[FD]:** >> >> Move the non-negative value to the second operand; this will ensure that we correctly handle 0.0 and -0.0 values, if values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. Existing MINPS and MAXPS semantics only check for NaN as the second operand; hence, we need special handling to check for NaN at the first operand. >> >> btmp = (b < +0.0) ? a : b >> atmp = (b < +0.0) ? b : a >> Tmp = Max_Float(atmp , btmp) >> Res = (atmp == NaN) ? atmp : Tmp >> >> For min[FD] we need a small tweak in the above algorithm, i.e., move the non-negative value to the first operand, this will ensure that we correctly select -0.0 if both the operands being compared are 0.0 or -0.0. >> >> btmp = (b < +0.0) ? b : a >> atmp = (b < +0.0) ? a : b >> Tmp = Max_Float(atmp , btmp) >> Res = (atmp == NaN) ? atmp : Tmp >> >> Thus, we need additional special handling for NaNs and +/-0.0 to compute floating-point min/max values to comply with the semantics of Math.max/min APIs using existing MINPS / MAXPS instructions. AVX10.2 added a new instruction, VPMINMAX[SH,SS,SD]/[PH,PS,PD], which comprehensively handles special cases, thereby eliminating the need for special handling. >> >> Patch emits new instructions for reduction and non-reduction operations for single, double, and Float16 type. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin >> >> [1] https://www.intel.com/content/www/us/en/content-details/856721/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html?wapkw=AVX10 > > Jatin Bhateja has updated the pull request incrementally with two additional commits since the last revision: > > - Update src/hotspot/cpu/x86/x86_64.ad > > Co-authored-by: Manuel H?ssig > - Update src/hotspot/cpu/x86/x86_64.ad > > Co-authored-by: Manuel H?ssig src/hotspot/cpu/x86/assembler_x86.cpp line 8800: > 8798: attributes.set_is_evex_instruction(); > 8799: attributes.set_embedded_opmask_register_specifier(mask); > 8800: attributes.set_address_attributes(/* tuple_type */ EVEX_FVM, /* input_size_in_bits */ EVEX_NObit); It looks to me that the tuple_type should be EVEX_FV for all of evminmax ps, pd, ph. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25914#discussion_r2178735442 From kbarrett at openjdk.org Wed Jul 2 00:30:44 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Jul 2025 00:30:44 GMT Subject: RFR: 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 23:16:20 GMT, Vladimir Kozlov wrote: >> Please review this trivial fix of a format string. The value being printed is >> TieredStopAtLevel, which is of type intx, so "%zd" should be used instead of "%d". >> >> Testing: mach5 tier1 > > Thank you for checking other solutions. > > Current fix is good. Thanks for reviews @vnkozlov , @mhaessig , and @mur47x111 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26051#issuecomment-3025913681 From kbarrett at openjdk.org Wed Jul 2 00:30:44 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 2 Jul 2025 00:30:44 GMT Subject: Integrated: 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 16:14:08 GMT, Kim Barrett wrote: > Please review this trivial fix of a format string. The value being printed is > TieredStopAtLevel, which is of type intx, so "%zd" should be used instead of "%d". > > Testing: mach5 tier1 This pull request has now been integrated. Changeset: c6448dc3 Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/c6448dc3afb1da9d93bb94804aa1971a650b91b7 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8361086: JVMCIGlobals::check_jvmci_flags_are_consistent has incorrect format string Reviewed-by: kvn, mhaessig, yzheng ------------- PR: https://git.openjdk.org/jdk/pull/26051 From sviswanathan at openjdk.org Wed Jul 2 00:31:41 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 2 Jul 2025 00:31:41 GMT Subject: RFR: 8360116: Add support for AVX10 floating point minmax instruction [v5] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 23:49:30 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with two additional commits since the last revision: >> >> - Update src/hotspot/cpu/x86/x86_64.ad >> >> Co-authored-by: Manuel H?ssig >> - Update src/hotspot/cpu/x86/x86_64.ad >> >> Co-authored-by: Manuel H?ssig > > src/hotspot/cpu/x86/assembler_x86.cpp line 8800: > >> 8798: attributes.set_is_evex_instruction(); >> 8799: attributes.set_embedded_opmask_register_specifier(mask); >> 8800: attributes.set_address_attributes(/* tuple_type */ EVEX_FVM, /* input_size_in_bits */ EVEX_NObit); > > It looks to me that the tuple_type should be EVEX_FV for all of evminmax ps, pd, ph. Other than that the rest of the PR looks good to me. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25914#discussion_r2178762877 From xgong at openjdk.org Wed Jul 2 01:45:46 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 2 Jul 2025 01:45:46 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: <19rf4A0bxc4BstRmLivGkoCOm7Qa7YD6z1VJHJivCtg=.4a643c7b-4e79-4f37-b230-7231df3c68a8@github.com> References: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> <19rf4A0bxc4BstRmLivGkoCOm7Qa7YD6z1VJHJivCtg=.4a643c7b-4e79-4f37-b230-7231df3c68a8@github.com> Message-ID: On Tue, 1 Jul 2025 16:07:59 GMT, Mikhail Ablakatov wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2097: >> >>> 2095: sve_movprfx(vtmp1, vsrc); // copy >>> 2096: sve_ext(vtmp1, vtmp1, vector_length_in_bytes / 2); // swap halves >>> 2097: sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); // multiply halves >> >>> sve_mul(vtmp1, elemType_to_regVariant(bt), pgtmp, vsrc); >> >> Can we use `ptrue` instread of `pgtmp` here? The higher bits can be computed, but they have not influences to the final results, right? > > Thanks! For some reason I thought that we don't have a dedicated predicate register for that. We can directly use `ptrue` here which maps to `p7` and has been preserved and initialized as all true. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178816427 From xgong at openjdk.org Wed Jul 2 01:48:50 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 2 Jul 2025 01:48:50 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> Message-ID: On Tue, 1 Jul 2025 16:10:58 GMT, Mikhail Ablakatov wrote: >> src/hotspot/cpu/aarch64/aarch64_vector.ad line 3536: >> >>> 3534: >>> 3535: instruct reduce_mulF_gt128b(vRegF dst, vRegF fsrc, vReg vsrc, vReg tmp) %{ >>> 3536: predicate(Matcher::vector_length_in_bytes(n->in(2)) > 16 && n->as_Reduction()->requires_strict_order()); >> >> Are there the cases that can match with this rule? > > Well, we don't match it right now for auto-vectorization as it doesn't worth it performance-wise. This might change for future implementations of SVE(2). I'd still prefer to keep it so the set of instructions is complete. Removing is fine to me, as actually we do not have the case to test the correctness. Or maybe you could just do some changes locally (e.g. removing the `requires_strict_order` predication and the un-strict-order rule), and test it with VectorAPI cases? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178819064 From xgong at openjdk.org Wed Jul 2 01:54:50 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 2 Jul 2025 01:54:50 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 18:03:33 GMT, Paul Sandoz wrote: > This is a nice simplification, Java changes look good. I'll let the Intel folks sign-off related to regressions. IMO minor regressions like this are acceptable if the generated code quality is good, and if the benchmark reports higher variance and averaging results from multiple forks close the gap. (In this case i don't understand how the Java changes impacts alignment). Thanks for your review and comments @PaulSandoz ! The java changes in this patch makes the outer loop in test not be peeled as before since all the range checks or branches are hoisted out side of the loop. While it needs one iteration of loop peeling to eliminate branches before. I think this makes the whole generated code's layout changes a lot. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3026080127 From xgong at openjdk.org Wed Jul 2 01:54:51 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 2 Jul 2025 01:54:51 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: Message-ID: <7GrGfBF_v8F0v02sRHC78ofMZwpMdzQZaHeYlNvi_N0=.93defb9e-ca9b-41b4-8722-1746692e2316@github.com> On Tue, 1 Jul 2025 21:30:20 GMT, Sandhya Viswanathan wrote: > Agree with Paul, these are minor regressions. Let us proceed with this patch. Thanks so much for your review @sviswa7 ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3026080679 From jbhateja at openjdk.org Wed Jul 2 01:57:46 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 2 Jul 2025 01:57:46 GMT Subject: RFR: 8360116: Add support for AVX10 floating point minmax instruction [v6] In-Reply-To: References: Message-ID: > Intel@ AVX10 ISA [1] extensions added new floating point MIN/MAX instructions which comply with definitions in IEEE-754-2019 standard section 9.6 and can directly emulate Math.min/max semantics without the need for any special handling for NaN, +0.0 or -0.0 detection. > > **The following pseudo-code describes the existing algorithm for min/max[FD]:** > > Move the non-negative value to the second operand; this will ensure that we correctly handle 0.0 and -0.0 values, if values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. Existing MINPS and MAXPS semantics only check for NaN as the second operand; hence, we need special handling to check for NaN at the first operand. > > btmp = (b < +0.0) ? a : b > atmp = (b < +0.0) ? b : a > Tmp = Max_Float(atmp , btmp) > Res = (atmp == NaN) ? atmp : Tmp > > For min[FD] we need a small tweak in the above algorithm, i.e., move the non-negative value to the first operand, this will ensure that we correctly select -0.0 if both the operands being compared are 0.0 or -0.0. > > btmp = (b < +0.0) ? b : a > atmp = (b < +0.0) ? a : b > Tmp = Max_Float(atmp , btmp) > Res = (atmp == NaN) ? atmp : Tmp > > Thus, we need additional special handling for NaNs and +/-0.0 to compute floating-point min/max values to comply with the semantics of Math.max/min APIs using existing MINPS / MAXPS instructions. AVX10.2 added a new instruction, VPMINMAX[SH,SS,SD]/[PH,PS,PD], which comprehensively handles special cases, thereby eliminating the need for special handling. > > Patch emits new instructions for reduction and non-reduction operations for single, double, and Float16 type. > > Kindly review and share your feedback. > > Best Regards, > Jatin > > [1] https://www.intel.com/content/www/us/en/content-details/856721/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html?wapkw=AVX10 Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Sandhya's review comments resolution ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25914/files - new: https://git.openjdk.org/jdk/pull/25914/files/5597b615..3854a871 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25914&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25914&range=04-05 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/25914.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25914/head:pull/25914 PR: https://git.openjdk.org/jdk/pull/25914 From jbhateja at openjdk.org Wed Jul 2 02:04:41 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 2 Jul 2025 02:04:41 GMT Subject: RFR: 8360116: Add support for AVX10 floating point minmax instruction [v5] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 00:29:02 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/assembler_x86.cpp line 8800: >> >>> 8798: attributes.set_is_evex_instruction(); >>> 8799: attributes.set_embedded_opmask_register_specifier(mask); >>> 8800: attributes.set_address_attributes(/* tuple_type */ EVEX_FVM, /* input_size_in_bits */ EVEX_NObit); >> >> It looks to me that the tuple_type should be EVEX_FV for all of evminmax ps, pd, ph. > > Other than that the rest of the PR looks good to me. > It looks to me that the tuple_type should be EVEX_FV for all of evminmax ps, pd, ph. Yes, all these new vector instructions do have embedded broadcast variants. We don't use them currently, in the absence of embedded broadcasting, the scalar factor (N) selection for compressed disp8 displacement is the same for both EVEX_FV and EVEX_FVM tuple types. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25914#discussion_r2178831749 From xgong at openjdk.org Wed Jul 2 02:39:33 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 2 Jul 2025 02:39:33 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v2] In-Reply-To: References: Message-ID: > ### Background > On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. > > For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. > > To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. > > ### Impact Analysis > #### 1. Vector types > Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. > > #### 2. Vector API > No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. > > #### 3. Auto-vectorization > Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. > > #### 4. Codegen of vector nodes > NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. > > Details: > - Lanewise vector operations are unaffected as explained above. > - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). > - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_s... Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Refine comments based on review suggestion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26057/files - new: https://git.openjdk.org/jdk/pull/26057/files/5af5bd49..4e15e588 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=00-01 Stats: 9 lines in 3 files changed: 0 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/26057.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26057/head:pull/26057 PR: https://git.openjdk.org/jdk/pull/26057 From xgong at openjdk.org Wed Jul 2 02:39:34 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 2 Jul 2025 02:39:34 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors In-Reply-To: References: Message-ID: <0PdYt-pCobM5mAb4q3nDcR9PKz89QVFCsZF-jnMAv4Q=.6a5d9f1f-8b68-448c-ab72-2f7f4a12322e@github.com> On Tue, 1 Jul 2025 05:59:15 GMT, Xiaohong Gong wrote: > ### Background > On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. > > For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. > > To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. > > ### Impact Analysis > #### 1. Vector types > Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. > > #### 2. Vector API > No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. > > #### 3. Auto-vectorization > Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. > > #### 4. Codegen of vector nodes > NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. > > Details: > - Lanewise vector operations are unaffected as explained above. > - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). > - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_s... Hi @theRealAph , I'v updated the patch by fixing the comment issues. Could you please take a look at it again? Thanks a lot! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3026147575 From thartmann at openjdk.org Wed Jul 2 05:22:38 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 05:22:38 GMT Subject: RFR: 8361180: Disable CompiledDirectCall verification with -VerifyInlineCaches In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:47:40 GMT, Aleksey Shipilev wrote: > Missed the spot when doing [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). There is a path from GC that calls into IC verification when cleaning the caches. See `nmethod::cleanup_inline_caches_impl`. It does verification per callsite, and does the whole thing during parallel GC cleanup, which is STW at least in G1. This gets expensive for CTW scenarios. We should wrap that under the same flag introduced by [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). > > Motivational improvements: > > > $ time CONF=linux-x86_64-server-fastdebug make test TEST=applications/ctw/modules/ > > # Current mainline > real 3m59.274s > user 68m9.663s > sys 5m19.026s > > # This PR > real 3m49.118s > user 65m37.962s > sys 5m15.441s Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26063#pullrequestreview-2977769711 From thartmann at openjdk.org Wed Jul 2 05:36:24 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 05:36:24 GMT Subject: [jdk25] RFR: 8358179: Performance regression in Math.cbrt Message-ID: Hi all, This pull request contains a backport of commit [38f59f84](https://github.com/openjdk/jdk/commit/38f59f84c98dfd974eec0c05541b2138b149def7) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. The commit being backported was authored by Mohamed Issa on 1 Jul 2025 and was reviewed by Sandhya Viswanathan, Srinivas Vamsi Parasa and Emanuel Peter. Thanks! ------------- Commit messages: - Backport 38f59f84c98dfd974eec0c05541b2138b149def7 Changes: https://git.openjdk.org/jdk/pull/26085/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26085&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8358179 Stats: 50 lines in 1 file changed: 11 ins; 36 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/26085.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26085/head:pull/26085 PR: https://git.openjdk.org/jdk/pull/26085 From shade at openjdk.org Wed Jul 2 05:40:42 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 2 Jul 2025 05:40:42 GMT Subject: RFR: 8361180: Disable CompiledDirectCall verification with -VerifyInlineCaches In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:47:40 GMT, Aleksey Shipilev wrote: > Missed the spot when doing [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). There is a path from GC that calls into IC verification when cleaning the caches. See `nmethod::cleanup_inline_caches_impl`. It does verification per callsite, and does the whole thing during parallel GC cleanup, which is STW at least in G1. This gets expensive for CTW scenarios. We should wrap that under the same flag introduced by [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). > > Motivational improvements: > > > $ time CONF=linux-x86_64-server-fastdebug make test TEST=applications/ctw/modules/ > > # Current mainline > real 3m59.274s > user 68m9.663s > sys 5m19.026s > > # This PR > real 3m49.118s > user 65m37.962s > sys 5m15.441s Thanks! Here goes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26063#issuecomment-3026519823 From shade at openjdk.org Wed Jul 2 05:40:43 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 2 Jul 2025 05:40:43 GMT Subject: Integrated: 8361180: Disable CompiledDirectCall verification with -VerifyInlineCaches In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:47:40 GMT, Aleksey Shipilev wrote: > Missed the spot when doing [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). There is a path from GC that calls into IC verification when cleaning the caches. See `nmethod::cleanup_inline_caches_impl`. It does verification per callsite, and does the whole thing during parallel GC cleanup, which is STW at least in G1. This gets expensive for CTW scenarios. We should wrap that under the same flag introduced by [JDK-8360867](https://bugs.openjdk.org/browse/JDK-8360867). > > Motivational improvements: > > > $ time CONF=linux-x86_64-server-fastdebug make test TEST=applications/ctw/modules/ > > # Current mainline > real 3m59.274s > user 68m9.663s > sys 5m19.026s > > # This PR > real 3m49.118s > user 65m37.962s > sys 5m15.441s This pull request has now been integrated. Changeset: 1ac74898 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/1ac74898745ce9b109db5571d9dcbd907dd05831 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8361180: Disable CompiledDirectCall verification with -VerifyInlineCaches Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/26063 From yongheng_hgq at 126.com Wed Jul 2 05:49:19 2025 From: yongheng_hgq at 126.com (h) Date: Wed, 2 Jul 2025 13:49:19 +0800 (CST) Subject: =?GBK?Q?RFR:_8358568=A3=BAC2_compilation_hits_"must_have_a_mon?= =?GBK?Q?itor"_assert_with_-XX:-GenerateSynchronizationCode?= Message-ID: <5f3eb53a.5267.197c9aeb416.Coremail.yongheng_hgq@126.com> Hi all, Please review this fix for JDK-8358568. It addresses a crash caused by accessing monitor info when -XX:-GenerateSynchronizationCode is set. The fix adds a guard in Parse::do_monitor_exit() to avoid the crash. Thank you in advance.Changes: https://github.com/openjdk/jdk8u-dev/pull/664/fileswebrev: https://openjdk.github.io/cr/?repo=jdk8u-dev&pr=664&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8358568 Patch: https://git.openjdk.org/jdk8u-dev/pull/664.diff PR: https://github.com/openjdk/jdk8u-dev/pull/664 BR -------------- next part -------------- An HTML attachment was scrubbed... URL: From haosun at openjdk.org Wed Jul 2 06:45:46 2025 From: haosun at openjdk.org (Hao Sun) Date: Wed, 2 Jul 2025 06:45:46 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: <8L1C1JR9H-GIASZlUG7Gk5Jf9rjVEVuBn-Sf9r8STYA=.843085aa-efb3-436e-acb3-ab4d1f52a9d8@github.com> On Wed, 18 Jun 2025 12:12:16 GMT, Mikhail Ablakatov wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2002: >> >>> 2000: assert(vector_length_in_bytes == 8 || vector_length_in_bytes == 16, "unsupported"); >>> 2001: assert_different_registers(vtmp1, vsrc); >>> 2002: assert_different_registers(vtmp1, vtmp2); >> >> nit: would be neat to use >> Suggestion: >> >> assert_different_registers(vsrc, vtmp1, vtmp2); > > `vsrc` and `vtmp2` are allowed to match. I see your point. IIUC, we should not modify `vsrc` as it's the source operand. If we allow `vsrc` and `vtmp2` to match, then `vsrc` is modified then. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2179185158 From haosun at openjdk.org Wed Jul 2 06:45:48 2025 From: haosun at openjdk.org (Hao Sun) Date: Wed, 2 Jul 2025 06:45:48 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v5] In-Reply-To: References: Message-ID: <2zMCHzKXQ1kBfjcU5Fc8s6fa2W6TTCKpSSjhB0dMdLw=.3c43071b-3982-4e0e-a300-e0547f4fbbec@github.com> On Tue, 1 Jul 2025 15:48:00 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: > > fixup: remove undefined insts from aarch64-asmtest.py test/hotspot/jtreg/compiler/loopopts/superword/TestVectorFPReduction.java line 2: > 1: /* > 2: * Copyright (c) 2025, Arm Limited. All rights reserved. `XX, YY,` means this file was created at XX year and the latest update was made at YY year. If `XX=YY`, then use `XX,`. Suggestion: * Copyright (c) 2024, 2025, Arm Limited. All rights reserved. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2178924210 From dfenacci at openjdk.org Wed Jul 2 07:05:40 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Wed, 2 Jul 2025 07:05:40 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal In-Reply-To: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Tue, 1 Jul 2025 11:35:06 GMT, Beno?t Maillard wrote: > This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. > > By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) > - [x] tier1-3, plus some internal testing > > Thank you for reviewing! Thanks @benoitmaillard! Definitely an additional check worth doing. I left a couple of inline comments. src/hotspot/share/opto/phaseX.cpp line 1821: > 1819: // The number of nodes shoud not increase. > 1820: uint old_unique = C->unique(); > 1821: uint old_hash = n->hash(); Just to be consistent with `old_unique` we could add a small comment (here or below for both). What do you think? src/hotspot/share/opto/phaseX.cpp line 1838: > 1836: stringStream ss; // Print as a block without tty lock. > 1837: ss.cr(); > 1838: ss.print_cr("Ideal optimization did not make progress but hash node changed."); Suggestion: ss.print_cr("Ideal optimization did not make progress but node hash changed."); ------------- PR Review: https://git.openjdk.org/jdk/pull/26064#pullrequestreview-2977964471 PR Review Comment: https://git.openjdk.org/jdk/pull/26064#discussion_r2179270798 PR Review Comment: https://git.openjdk.org/jdk/pull/26064#discussion_r2179279429 From bmaillard at openjdk.org Wed Jul 2 07:19:30 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 2 Jul 2025 07:19:30 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v5] In-Reply-To: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: > This PR prevents some missed ideal optimizations in IGVN by notifying users of type refinements made during CCP, addressing a missed optimization that caused a verification failure with `-XX:VerifyIterativeGVN=1110`. > > ### Context > During the compilation of the input program (obtained from the fuzzer, then simplified and added as a test) by C2, we end up with node `591 ModI` that takes `138 Phi` as its divisor input. An existing `Ideal` optimization is to get rid of the control input of a `ModINode` when we can prove that the divisor is never `0`. > > In this specific case, the type of the `PhiNode` gets refined during CCP, but the refinement fails to propagate to its users for the IGVN phase and the ideal optimization for the `ModINode` never happens. This results in a missed optimization and hits an assert in the verification phase of IGVN (when using `-XX:VerifyIterativeGVN=1110`). > > ![IGV screenshot](https://github.com/user-attachments/assets/5dee1ae6-9146-4115-922d-df33b7ccbd37) > > ### Detailed Analysis > > In `PhaseCCP::analyze`, we call `Value` for the `PhiNode`, which > results in a type refinement: the range gets restricted to `int:-13957..-1191`. > > ```c++ > // Pull from worklist; compute new value; push changes out. > // This loop is the meat of CCP. > while (worklist.size() != 0) { > Node* n = fetch_next_node(worklist); > DEBUG_ONLY(worklist_verify.push(n);) > if (n->is_SafePoint()) { > // Make sure safepoints are processed by PhaseCCP::transform even if they are > // not reachable from the bottom. Otherwise, infinite loops would be removed. > _root_and_safepoints.push(n); > } > const Type* new_type = n->Value(this); > if (new_type != type(n)) { > DEBUG_ONLY(verify_type(n, new_type, type(n));) > dump_type_and_node(n, new_type); > set_type(n, new_type); > push_child_nodes_to_worklist(worklist, n); > } > if (KillPathsReachableByDeadTypeNode && n->is_Type() && new_type == Type::TOP) { > // Keep track of Type nodes to kill CFG paths that use Type > // nodes that become dead. > _maybe_top_type_nodes.push(n); > } > } > DEBUG_ONLY(verify_analyze(worklist_verify);) > > > At the end of `PhaseCCP::analyze`, we obtain the following types in the side table: > - `int` for node `591` (`ModINode`) > - `int:-13957..-1191` for node `138` (`PhiNode`) > > If we call `find_node(138)->bottom_type()`, we get: > - `int` for both nodes > > There is no progress on the type of `ModINode` during CCP, because `ModINode::Value` > is not able to... Beno?t Maillard has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Fix bad test class name - 8359602: rename test - 8359602: remove requires.debug=true and add -XX:+IgnoreUnrecognizedVMOptions flag - 8359602: add comment - 8359602: add test summary and comments - 8359602: tag requires vm.debug == true - 8359602: Add test from fuzzer - 8359602: Add users to IGVN worklist when type is refined in CCP ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26017/files - new: https://git.openjdk.org/jdk/pull/26017/files/005b2825..a66d3fb4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26017&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26017&range=03-04 Stats: 18268 lines in 747 files changed: 7677 ins; 6510 del; 4081 mod Patch: https://git.openjdk.org/jdk/pull/26017.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26017/head:pull/26017 PR: https://git.openjdk.org/jdk/pull/26017 From thartmann at openjdk.org Wed Jul 2 07:19:31 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 07:19:31 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v4] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: On Tue, 1 Jul 2025 12:58:29 GMT, Beno?t Maillard wrote: >> This PR prevents some missed ideal optimizations in IGVN by notifying users of type refinements made during CCP, addressing a missed optimization that caused a verification failure with `-XX:VerifyIterativeGVN=1110`. >> >> ### Context >> During the compilation of the input program (obtained from the fuzzer, then simplified and added as a test) by C2, we end up with node `591 ModI` that takes `138 Phi` as its divisor input. An existing `Ideal` optimization is to get rid of the control input of a `ModINode` when we can prove that the divisor is never `0`. >> >> In this specific case, the type of the `PhiNode` gets refined during CCP, but the refinement fails to propagate to its users for the IGVN phase and the ideal optimization for the `ModINode` never happens. This results in a missed optimization and hits an assert in the verification phase of IGVN (when using `-XX:VerifyIterativeGVN=1110`). >> >> ![IGV screenshot](https://github.com/user-attachments/assets/5dee1ae6-9146-4115-922d-df33b7ccbd37) >> >> ### Detailed Analysis >> >> In `PhaseCCP::analyze`, we call `Value` for the `PhiNode`, which >> results in a type refinement: the range gets restricted to `int:-13957..-1191`. >> >> ```c++ >> // Pull from worklist; compute new value; push changes out. >> // This loop is the meat of CCP. >> while (worklist.size() != 0) { >> Node* n = fetch_next_node(worklist); >> DEBUG_ONLY(worklist_verify.push(n);) >> if (n->is_SafePoint()) { >> // Make sure safepoints are processed by PhaseCCP::transform even if they are >> // not reachable from the bottom. Otherwise, infinite loops would be removed. >> _root_and_safepoints.push(n); >> } >> const Type* new_type = n->Value(this); >> if (new_type != type(n)) { >> DEBUG_ONLY(verify_type(n, new_type, type(n));) >> dump_type_and_node(n, new_type); >> set_type(n, new_type); >> push_child_nodes_to_worklist(worklist, n); >> } >> if (KillPathsReachableByDeadTypeNode && n->is_Type() && new_type == Type::TOP) { >> // Keep track of Type nodes to kill CFG paths that use Type >> // nodes that become dead. >> _maybe_top_type_nodes.push(n); >> } >> } >> DEBUG_ONLY(verify_analyze(worklist_verify);) >> >> >> At the end of `PhaseCCP::analyze`, we obtain the following types in the side table: >> - `int` for node `591` (`ModINode`) >> - `int:-13957..-1191` for node `138` (`PhiNode`) >> >> If we call `find_node(138)->bottom_type()`, we get: >> - `int` for both nodes >> >> The... > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > 8359602: update case for consistency > > Co-authored-by: Emanuel Peter Still good, thanks for making these changes. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26017#pullrequestreview-2978014920 From thartmann at openjdk.org Wed Jul 2 07:20:41 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 07:20:41 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code In-Reply-To: References: Message-ID: <-7cfzVghCWnUCfB1F3dcyG2fvJGnqREUW98qiVJEvQQ=.db06fb1e-e96e-4e00-bac0-098b4e1de54c@github.com> On Tue, 1 Jul 2025 12:26:44 GMT, Aleksey Shipilev wrote: > We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. > > There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. > > After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:C1MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. > > Pre-empting the question: "Well, why not use -Xcomp then, and make sure it inlines well?" The answer is in RFE as well: Xcomp causes _a lot_ of stray compilations for JDK and CTW infra itself. For small JARs in large corpus this eats precious testing time that we would instead like to spend on deeper inlining in the actual JAR code. This also does not force us to look into how CTW works in Xcomp at all; I expect some surprises there. Feather-touching the inlining heuristic paths to just accept methods without looking at profiles looks better. > > Tobias had an idea to implement the stress randomized inlining that would expand the scope of inlining. This improvement stacks well with it. This improvement provides the base case of inlining most reasonable methods, and then allow stress infra to inline some more on top of that. > > Additional testing: > - [x] GHA > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > - [x] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) I submitted some testing to make sure that CTW is clean in our CI. src/hotspot/share/compiler/compiler_globals.hpp line 400: > 398: product(bool, InlineColdMethods, false, DIAGNOSTIC, \ > 399: "Inline methods cold methods that would otherwise rejected " \ > 400: "based on profile information. Only useful for compiler testing.")\ Suggestion: "Inline cold methods that would otherwise be rejected based" \ "on profile information. Only useful for compiler testing.") \ ------------- PR Comment: https://git.openjdk.org/jdk/pull/26068#issuecomment-3026732006 PR Review Comment: https://git.openjdk.org/jdk/pull/26068#discussion_r2179310625 From eastigeevich at openjdk.org Wed Jul 2 07:40:40 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 2 Jul 2025 07:40:40 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 16:05:07 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. >> >> This PR adds a requirement for the test to be run on debug builds only. >> >> Tested: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test skipped. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Simplify requirement for debug build I finished bisecting. This is my changes in the test which made it failing: @@ -56,7 +58,6 @@ public static void main(String[] args) throws Exception { command.add("-showversion"); command.add("-XX:-BackgroundCompilation"); command.add("-XX:+UnlockDiagnosticVMOptions"); - command.add("-XX:+PrintAssembly"); if (compiler.equals("c2")) { command.add("-XX:-TieredCompilation"); } else if (compiler.equals("c1")) { @@ -69,13 +70,17 @@ public static void main(String[] args) throws Exception { command.add("-XX:OnSpinWaitInst=" + spinWaitInst); command.add("-XX:OnSpinWaitInstCount=" + spinWaitInstCount); command.add("-XX:CompileCommand=compileonly," + Launcher.class.getName() + "::" + "test"); + command.add("-XX:CompileCommand=print," + Launcher.class.getName() + "::" + "test"); command.add(Launcher.class.getName()); It looks like `XX:+PrintAssembly` prints out debug info in release builds but `XX:CompileCommand=print` does not. I am switching back to `XX:+PrintAssembly`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3026790161 From thartmann at openjdk.org Wed Jul 2 07:43:47 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 07:43:47 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v4] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: <6XXAMA5_Jq8NxpK0TOTAJWkYhDXIo4Wrnz_0X32SkqQ=.b9e29a9c-fe36-4c04-88bc-d276a66fd711@github.com> On Wed, 2 Jul 2025 07:14:34 GMT, Tobias Hartmann wrote: >> Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: >> >> 8359602: update case for consistency >> >> Co-authored-by: Emanuel Peter > > Still good, thanks for making these changes. > @TobiHartmann how much should he invest in this now? An alternative is just tackling all the other cases later. What do you think? Yes, agreed. Let's handle this later. (Sorry, somehow I thought I had replied to this already - must have missed pressing the Comment button..) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26017#issuecomment-3026800468 From thartmann at openjdk.org Wed Jul 2 07:50:50 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 07:50:50 GMT Subject: RFR: 8359120: Improve warning message when fail to load hsdis library [v2] In-Reply-To: <-i6UPk-bhy9RnnCus_JbJ1nQ63nMX9djubON9WBbHQ8=.a2305566-563e-4171-b526-bcd645de51a3@github.com> References: <-i6UPk-bhy9RnnCus_JbJ1nQ63nMX9djubON9WBbHQ8=.a2305566-563e-4171-b526-bcd645de51a3@github.com> Message-ID: On Mon, 23 Jun 2025 08:56:12 GMT, Taizo Kurashige wrote: >> This PR is improvement of warning message when fail to load hsdis library. >> >> [JDK-8287001](https://bugs.openjdk.org/browse/JDK-8287001) introduced a warning on hsdis library load failure. This is useful when the user executes -XX:+PrintAssembly, etc. >> >> However, I think that when hs_err occurs, users might be confused by this warning printed by Xlog. Because users are not likely to know that hsdis is loaded for the [MachCode] section of the hs_err report, they may wonder, for example, "Why do I get warnings about hsdis load errors when -XX:+PrintAssembly is not specified?." >> >> To clear up this confusion, I suggest printing a warning just before [MachCode]. >> >>
>> >> sample output >> >> If hs_err occurs and hsdis load fails without the option to specify where the hs_err report should be output, the following is output to the hs_err_pir log file: >> >> . >> . >> native method entry point (kind = native) [0x000001ae8753cec0, 0x000001ae8753dac0] 3072 bytes >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> 0x000001ae8753cec0: 488b 4b08 | 0fb7 492e | 584c 8d74 | ccf8 6800 | 0000 0068 | 0000 0000 | 5055 488b | ec41 5548 >> 0x000001ae8753cee0: 8b43 084c | 8d68 3848 | 8b40 0868 | 0000 0000 | 5348 8b50 | 18 >> . >> . >> >> >> If -XX:+PrintAssembly is specified and hsdis load fails, the following is output to the stdout >> >> $ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -version >> OpenJDK 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output >> >> ============================= C1-compiled nmethod ============================== >> ----------------------------------- Assembly ----------------------------------- >> >> Compiled method (c1) 57 2 3 java.lang.Object:: (1 bytes) >> total in heap [0x0000024a08a00008,0x0000024a08a00208] = 512 >> . >> . >> >> [Constant Pool (empty)] >> >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> [Instructions begin] >> 0x0000024a08a00100: 6666 660f | 1f84 0000 | 0000 0066 | 6666 9066 | 6690 448b | 5208 443b >> . >> . >> [Constant Pool (empty)] >> >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> [Verified Entry Point] >> # {method} {0x00000000251a1898} 'toUnsignedInt' '(B)I' in 'java/lang/Byte >> . >> . >> >> >>
>> >> Since... > > Taizo Kurashige has updated the pull request incrementally with one additional commit since the last revision: > > Fix message and revert lines for Xlog Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25726#pullrequestreview-2978107826 From thartmann at openjdk.org Wed Jul 2 07:50:52 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 07:50:52 GMT Subject: RFR: 8359120: Improve warning message when fail to load hsdis library [v2] In-Reply-To: References: <-i6UPk-bhy9RnnCus_JbJ1nQ63nMX9djubON9WBbHQ8=.a2305566-563e-4171-b526-bcd645de51a3@github.com> <7UkSbnceEz4PY3UDwyR9iOseuvS4sD8FBBGl96mG_lk=.e94b4126-9df5-406b-a3f3-b21439d848e6@github.com> Message-ID: On Mon, 30 Jun 2025 11:08:45 GMT, Taizo Kurashige wrote: > but since nullptr is passed at [src/hotspot/share/compiler/disassembler.hpp#L66](https://github.com/openjdk/jdk/blob/c2d76f9844aadf77a0b213a9169a7c5c8c8f1ffb/src/hotspot/share/compiler/disassembler.hpp#L66), that reporting doesn't actually work. Right, it will be set to `tty` when Verbose is true: https://github.com/openjdk/jdk/blob/c2d76f9844aadf77a0b213a9169a7c5c8c8f1ffb/src/hotspot/share/compiler/disassembler.cpp#L780 Thanks for the additional details of why you decided to not use that code. I'm fine with these changes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25726#issuecomment-3026818978 From aph at openjdk.org Wed Jul 2 08:05:44 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 2 Jul 2025 08:05:44 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 07:37:40 GMT, Evgeny Astigeevich wrote: > > It looks like `XX:+PrintAssembly` prints out debug info in release builds but `XX:CompileCommand=print` does not. I am switching back to `XX:+PrintAssembly`. That's not great. What info do you need, exactly? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3026870108 From bkilambi at openjdk.org Wed Jul 2 08:10:24 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 2 Jul 2025 08:10:24 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v9] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits: - Merge master - code style issues fixed - Addressed review comments - Addressed review comments - Revert a small change in c2_MacroAssembler.hpp - Addressed review comments - Addressed review comments and added a JTREG test - Merge master - 8348868: AArch64: Add backend support for SelectFromTwoVector This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. For 64-bit vector length : Neon tbl instruction is generated for T_SHORT and T_BYTE types only. For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - Benchmark (size) Mode Cnt Gain SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. ------------- Changes: https://git.openjdk.org/jdk/pull/23570/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=08 Stats: 987 lines in 11 files changed: 952 ins; 0 del; 35 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From epeter at openjdk.org Wed Jul 2 08:11:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 2 Jul 2025 08:11:44 GMT Subject: [jdk25] RFR: 8358179: Performance regression in Math.cbrt In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 05:30:39 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [38f59f84](https://github.com/openjdk/jdk/commit/38f59f84c98dfd974eec0c05541b2138b149def7) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Mohamed Issa on 1 Jul 2025 and was reviewed by Sandhya Viswanathan, Srinivas Vamsi Parasa and Emanuel Peter. > > Thanks! LGTM Marked as reviewed by epeter (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26085#pullrequestreview-2978172863 PR Review: https://git.openjdk.org/jdk/pull/26085#pullrequestreview-2978173371 From aph at openjdk.org Wed Jul 2 08:18:44 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 2 Jul 2025 08:18:44 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v2] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 02:39:33 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments based on review suggestion src/hotspot/cpu/aarch64/aarch64.ad line 2367: > 2365: // Theoretically, the minimal vector length supported by AArch64 > 2366: // ISA and Vector API species is 64-bit. However, 32-bit or 16-bit > 2367: // vector length is also allowed for special Vector API usages. Suggestion: // Usually, the shortest vector length supported by AArch64 // ISA and Vector API species is 64 bits. However, we allow // 32-bit or 16-bit vectors in a few special cases. Reason for change: it wasn't clear what "supported" meant. Supported by the hardware, or by HotSpot. And why do we only support it in a few special cases? This comment raises more questions than it answers. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2179423549 From thartmann at openjdk.org Wed Jul 2 08:25:45 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 08:25:45 GMT Subject: [jdk25] RFR: 8358179: Performance regression in Math.cbrt In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 05:30:39 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [38f59f84](https://github.com/openjdk/jdk/commit/38f59f84c98dfd974eec0c05541b2138b149def7) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Mohamed Issa on 1 Jul 2025 and was reviewed by Sandhya Viswanathan, Srinivas Vamsi Parasa and Emanuel Peter. > > Thanks! Thanks for the review Emanuel! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26085#issuecomment-3026926520 From thartmann at openjdk.org Wed Jul 2 08:25:46 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 08:25:46 GMT Subject: [jdk25] Integrated: 8358179: Performance regression in Math.cbrt In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 05:30:39 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [38f59f84](https://github.com/openjdk/jdk/commit/38f59f84c98dfd974eec0c05541b2138b149def7) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Mohamed Issa on 1 Jul 2025 and was reviewed by Sandhya Viswanathan, Srinivas Vamsi Parasa and Emanuel Peter. > > Thanks! This pull request has now been integrated. Changeset: 0a151c68 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/0a151c68d6529f3a1d3a44fbccc42b67a60b25d9 Stats: 50 lines in 1 file changed: 11 ins; 36 del; 3 mod 8358179: Performance regression in Math.cbrt Reviewed-by: epeter Backport-of: 38f59f84c98dfd974eec0c05541b2138b149def7 ------------- PR: https://git.openjdk.org/jdk/pull/26085 From bkilambi at openjdk.org Wed Jul 2 08:26:00 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 2 Jul 2025 08:26:00 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v10] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Addressed review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23570/files - new: https://git.openjdk.org/jdk/pull/23570/files/80a1f67f..e86d55df Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=08-09 Stats: 36 lines in 6 files changed: 13 ins; 0 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From bkilambi at openjdk.org Wed Jul 2 08:26:01 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 2 Jul 2025 08:26:01 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v8] In-Reply-To: References: Message-ID: On Fri, 27 Jun 2025 15:21:28 GMT, Andrew Haley wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> code style issues fixed > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 4231: > >> 4229: >> 4230: // SVE/SVE2 Programmable table lookup in one or two vector table (zeroing) >> 4231: void sve_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn, unsigned reg_count, FloatRegister Zm) { > > [Edited] > > This would be better: > > private: > void _sve_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn, unsigned reg_count, FloatRegister Zm) { > > > ... then 2 patterns ... > > > public: > void sve_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn1, FloatRegister Zn2, FloatRegister Zm); > void sve_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn, FloatRegister Zm); > > > ... and make sure that `Zn1+ 1 == Zn2` Done. Please review the latest patch. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2179438846 From epeter at openjdk.org Wed Jul 2 08:26:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 2 Jul 2025 08:26:46 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: <7GrGfBF_v8F0v02sRHC78ofMZwpMdzQZaHeYlNvi_N0=.93defb9e-ca9b-41b4-8722-1746692e2316@github.com> References: <7GrGfBF_v8F0v02sRHC78ofMZwpMdzQZaHeYlNvi_N0=.93defb9e-ca9b-41b4-8722-1746692e2316@github.com> Message-ID: On Wed, 2 Jul 2025 01:52:19 GMT, Xiaohong Gong wrote: >> Agree with Paul, these are minor regressions. Let us proceed with this patch. > >> Agree with Paul, these are minor regressions. Let us proceed with this patch. > > Thanks so much for your review @sviswa7 ! @XiaohongGong I quickly scanned the patch, it looks good to me too. I'm submitting some internal testing now, to make sure our extended testing does not break on integration. Should take about 24h. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3026931008 From shade at openjdk.org Wed Jul 2 08:27:24 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 2 Jul 2025 08:27:24 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code [v2] In-Reply-To: References: Message-ID: <5znMFGgSuss2iAJ3cUBnmIKrfniGHx5W6CpY3TpNO_8=.0148fb6b-206a-4b57-8886-db80d606b18f@github.com> > We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. > > There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. > > After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:C1MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. > > Pre-empting the question: "Well, why not use -Xcomp then, and make sure it inlines well?" The answer is in RFE as well: Xcomp causes _a lot_ of stray compilations for JDK and CTW infra itself. For small JARs in large corpus this eats precious testing time that we would instead like to spend on deeper inlining in the actual JAR code. This also does not force us to look into how CTW works in Xcomp at all; I expect some surprises there. Feather-touching the inlining heuristic paths to just accept methods without looking at profiles looks better. > > Tobias had an idea to implement the stress randomized inlining that would expand the scope of inlining. This improvement stacks well with it. This improvement provides the base case of inlining most reasonable methods, and then allow stress infra to inline some more on top of that. > > Additional testing: > - [x] GHA > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > - [x] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/compiler/compiler_globals.hpp Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26068/files - new: https://git.openjdk.org/jdk/pull/26068/files/b16cbabb..dedbcfed Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26068&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26068&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26068.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26068/head:pull/26068 PR: https://git.openjdk.org/jdk/pull/26068 From mhaessig at openjdk.org Wed Jul 2 08:35:51 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 2 Jul 2025 08:35:51 GMT Subject: RFR: 8360116: Add support for AVX10 floating point minmax instruction [v6] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 01:57:46 GMT, Jatin Bhateja wrote: >> Intel@ AVX10 ISA [1] extensions added new floating point MIN/MAX instructions which comply with definitions in IEEE-754-2019 standard section 9.6 and can directly emulate Math.min/max semantics without the need for any special handling for NaN, +0.0 or -0.0 detection. >> >> **The following pseudo-code describes the existing algorithm for min/max[FD]:** >> >> Move the non-negative value to the second operand; this will ensure that we correctly handle 0.0 and -0.0 values, if values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. Existing MINPS and MAXPS semantics only check for NaN as the second operand; hence, we need special handling to check for NaN at the first operand. >> >> btmp = (b < +0.0) ? a : b >> atmp = (b < +0.0) ? b : a >> Tmp = Max_Float(atmp , btmp) >> Res = (atmp == NaN) ? atmp : Tmp >> >> For min[FD] we need a small tweak in the above algorithm, i.e., move the non-negative value to the first operand, this will ensure that we correctly select -0.0 if both the operands being compared are 0.0 or -0.0. >> >> btmp = (b < +0.0) ? b : a >> atmp = (b < +0.0) ? a : b >> Tmp = Max_Float(atmp , btmp) >> Res = (atmp == NaN) ? atmp : Tmp >> >> Thus, we need additional special handling for NaNs and +/-0.0 to compute floating-point min/max values to comply with the semantics of Math.max/min APIs using existing MINPS / MAXPS instructions. AVX10.2 added a new instruction, VPMINMAX[SH,SS,SD]/[PH,PS,PD], which comprehensively handles special cases, thereby eliminating the need for special handling. >> >> Patch emits new instructions for reduction and non-reduction operations for single, double, and Float16 type. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin >> >> [1] https://www.intel.com/content/www/us/en/content-details/856721/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html?wapkw=AVX10 > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Sandhya's review comments resolution Marked as reviewed by mhaessig (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25914#pullrequestreview-2978248768 From mhaessig at openjdk.org Wed Jul 2 08:38:47 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 2 Jul 2025 08:38:47 GMT Subject: RFR: 8360641: TestCompilerCounts fails after 8354727 [v4] In-Reply-To: References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: <2Er7Cp5ry6llaeyDvSv7Tg0hIOvS9AOzrJM0zfIW1JM=.edce3d10-ad95-4c03-80e0-0e985ba692ab@github.com> On Tue, 1 Jul 2025 06:52:32 GMT, Manuel H?ssig wrote: >> After integrating #25872 the calculation of the`CICompilerCount` ergonomic became dependent on the size of `NonNMethodCodeHeapSize`, which itself is an ergonomic based on the available memory. Thus, depending on the system, the test `compiler/arguments/TestCompilerCounts.java` failed, i.e. locally this failed, but not on CI servers. >> >> This PR changes the test to reflect the changes introduced in #25872. >> >> Testing: >> - [ ] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15932906313) >> - [x] tier1,tier2 plus Oracle internal testing > > Manuel H?ssig has updated the pull request incrementally with one additional commit since the last revision: > > Fix whitespace > > Co-authored-by: Andrey Turbanov Thank you for your reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26024#issuecomment-3026962659 From mhaessig at openjdk.org Wed Jul 2 08:38:47 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 2 Jul 2025 08:38:47 GMT Subject: Integrated: 8360641: TestCompilerCounts fails after 8354727 In-Reply-To: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> References: <3mMrDF_446r7HudbsHIpdoWByBlnUpjFo7YzIty0KG8=.facc058f-3975-44c4-b2d4-93b8c64db185@github.com> Message-ID: On Fri, 27 Jun 2025 18:09:23 GMT, Manuel H?ssig wrote: > After integrating #25872 the calculation of the`CICompilerCount` ergonomic became dependent on the size of `NonNMethodCodeHeapSize`, which itself is an ergonomic based on the available memory. Thus, depending on the system, the test `compiler/arguments/TestCompilerCounts.java` failed, i.e. locally this failed, but not on CI servers. > > This PR changes the test to reflect the changes introduced in #25872. > > Testing: > - [ ] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15932906313) > - [x] tier1,tier2 plus Oracle internal testing This pull request has now been integrated. Changeset: 2304044a Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/2304044ab2f228fe2fe4adb5975291e733b12d5c Stats: 49 lines in 1 file changed: 34 ins; 1 del; 14 mod 8360641: TestCompilerCounts fails after 8354727 Reviewed-by: kvn, dfenacci, mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/26024 From snatarajan at openjdk.org Wed Jul 2 08:40:56 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 2 Jul 2025 08:40:56 GMT Subject: RFR: 8325478: Restructure the macro expansion compiler phase to not include macro elimination [v8] In-Reply-To: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> References: <4W6QHi3F3RN-JYfYAKUATR_xCUnOiUR0vT73ndqNZtk=.0e193c07-cad0-4cbd-86f2-1758a8c8bac9@github.com> Message-ID: On Tue, 1 Jul 2025 16:28:27 GMT, Saranya Natarajan wrote: >> This changeset restructures the macro expansion phase to not include macro elimination and also adds a flag StressMacroElimination which randomizes macro elimination ordering for stress testing purposes. >> >> Changes: >> - Implemented a method `eliminate_opaque_looplimit_macro_nodes` that removes the functionality for eliminating Opaque and LoopLimit nodes from the `expand_macro_nodes ` method. >> - Introduced compiler phases` PHASE_AFTER_MACRO_ELIMINATION` >> - Added a new Ideal phase for individual macro elimination steps. >> - Implemented the flag `StressMacroElimination`. Added functionality tests for `StressMacroElimination`, similar to previous stress flag `StressMacroExpansion` ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)). >> >> Below is a sample screenshot (IGV print level 4 ) mainly showing the new phase . >> ![image](https://github.com/user-attachments/assets/16013cd4-6ec6-4939-ac66-33bb03d59cd6) >> >> Questions to reviewers: >> - Is the new macro elimination phase OK, or should we change anything? >> - In `compile.cpp `, `PHASE_ITER_GVN_AFTER_ELIMINATION` follows `PHASE_AFTER_MACRO_ELIMINATION` in the current fix. Should `PHASE_ITER_GVN_AFTER_ELIMINATION` be removed ? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > review comments fix part 1 Thanks for the reviews everyone. Please sponsor ------------- PR Comment: https://git.openjdk.org/jdk/pull/25682#issuecomment-3026967089 From snatarajan at openjdk.org Wed Jul 2 08:40:57 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 2 Jul 2025 08:40:57 GMT Subject: Integrated: 8325478: Restructure the macro expansion compiler phase to not include macro elimination In-Reply-To: References: Message-ID: On Fri, 6 Jun 2025 22:40:34 GMT, Saranya Natarajan wrote: > This changeset restructures the macro expansion phase to not include macro elimination and also adds a flag StressMacroElimination which randomizes macro elimination ordering for stress testing purposes. > > Changes: > - Implemented a method `eliminate_opaque_looplimit_macro_nodes` that removes the functionality for eliminating Opaque and LoopLimit nodes from the `expand_macro_nodes ` method. > - Introduced compiler phases` PHASE_AFTER_MACRO_ELIMINATION` > - Added a new Ideal phase for individual macro elimination steps. > - Implemented the flag `StressMacroElimination`. Added functionality tests for `StressMacroElimination`, similar to previous stress flag `StressMacroExpansion` ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)). > > Below is a sample screenshot (IGV print level 4 ) mainly showing the new phase . > ![image](https://github.com/user-attachments/assets/16013cd4-6ec6-4939-ac66-33bb03d59cd6) > > Questions to reviewers: > - Is the new macro elimination phase OK, or should we change anything? > - In `compile.cpp `, `PHASE_ITER_GVN_AFTER_ELIMINATION` follows `PHASE_AFTER_MACRO_ELIMINATION` in the current fix. Should `PHASE_ITER_GVN_AFTER_ELIMINATION` be removed ? > > Testing: > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) This pull request has now been integrated. Changeset: eac8f5d2 Author: Saranya Natarajan Committer: Daniel Lund?n URL: https://git.openjdk.org/jdk/commit/eac8f5d2c99e1bcc526da0f6a05af76e815c2db9 Stats: 77 lines in 11 files changed: 54 ins; 8 del; 15 mod 8325478: Restructure the macro expansion compiler phase to not include macro elimination Reviewed-by: kvn, dlunden ------------- PR: https://git.openjdk.org/jdk/pull/25682 From eastigeevich at openjdk.org Wed Jul 2 08:47:44 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 2 Jul 2025 08:47:44 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 16:05:07 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. Release builds might not generate needed debug info. >> >> This PR adds a requirement for the test to be run on debug builds only. >> >> Tested: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test skipped. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Simplify requirement for debug build > OK, are you able to bisect which change? This fix to only do debug VM needs to be correctly linked to the actual cause, IMO. > > It looks like `XX:+PrintAssembly` prints out debug info in release builds but `XX:CompileCommand=print` does not. I am switching back to `XX:+PrintAssembly`. > > That's not great. What info do you need, exactly? # {method} {0x0000ffff50400378} 'test' '()V' in 'compiler/onSpinWait/TestOnSpinWaitAArch64$Launcher' # [sp+0x20] (sp of caller) 0x0000ffff985731c0: ff83 00d1 | fd7b 01a9 | 2803 0018 | 8923 40b9 | 1f01 09eb 0x0000ffff985731d4: ;*synchronization entry ; - compiler.onSpinWait.TestOnSpinWaitAArch64$Launcher::test at -1 (line 224) 0x0000ffff985731d4: 2102 0054 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5 0x0000ffff985731f0: ;*invokestatic onSpinWait {reexecute=0 rethrow=0 return_oop=0} ; - compiler.onSpinWait.TestOnSpinWaitAArch64$Launcher::test at 0 (line 224) 0x0000ffff985731f0: 1f20 03d5 | fd7b 41a9 | ff83 0091 0x0000ffff985731fc: ; {poll_return} 0x0000ffff985731fc: 8817 40f9 | ff63 28eb | 4800 0054 | c003 5fd6 0x0000ffff9857320c: ; {internal_word} 0x0000ffff9857320c: 88ff ff10 | 88a3 02f9 0x0000ffff98573214: ; {runtime_call SafepointBlob} 0x0000ffff98573214: 5bc3 fe17 0x0000ffff98573218: ; {runtime_call Stub::method_entry_barrier} 0x0000ffff98573218: 0850 96d2 | 480a b3f2 | e8ff dff2 | 0001 3fd6 | ecff ff17 The test searches for `- compiler.onSpinWait.TestOnSpinWaitAArch64$Launcher::test at 0` and `invokestatic onSpinWait`. They identify the place where to search instructions. Assembly from all builds always has `{poll_return}`. I can use it as a search point. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3026996074 From mablakatov at openjdk.org Wed Jul 2 08:48:59 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 2 Jul 2025 08:48:59 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v6] In-Reply-To: References: Message-ID: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms > IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms > LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms > > > Fujitsu A64FX (SVE 512-bit): > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms > IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms > LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: cleanup: update a copyright notice Co-authored-by: Hao Sun ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23181/files - new: https://git.openjdk.org/jdk/pull/23181/files/df09ab65..ebad6dd3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181 PR: https://git.openjdk.org/jdk/pull/23181 From mablakatov at openjdk.org Wed Jul 2 08:48:59 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Wed, 2 Jul 2025 08:48:59 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v5] In-Reply-To: <2zMCHzKXQ1kBfjcU5Fc8s6fa2W6TTCKpSSjhB0dMdLw=.3c43071b-3982-4e0e-a300-e0547f4fbbec@github.com> References: <2zMCHzKXQ1kBfjcU5Fc8s6fa2W6TTCKpSSjhB0dMdLw=.3c43071b-3982-4e0e-a300-e0547f4fbbec@github.com> Message-ID: On Wed, 2 Jul 2025 03:28:10 GMT, Hao Sun wrote: >> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: >> >> fixup: remove undefined insts from aarch64-asmtest.py > > test/hotspot/jtreg/compiler/loopopts/superword/TestVectorFPReduction.java line 2: > >> 1: /* >> 2: * Copyright (c) 2025, Arm Limited. All rights reserved. > > `XX, YY,` means this file was created at XX year and the latest update was made at YY year. If `XX=YY`, then use `XX,`. > > Suggestion: > > * Copyright (c) 2024, 2025, Arm Limited. All rights reserved. Thank you for catching this! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2179486265 From aph at openjdk.org Wed Jul 2 08:52:46 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 2 Jul 2025 08:52:46 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 08:45:23 GMT, Evgeny Astigeevich wrote: > ``` > > ``` > > > > > > The test searches for `- compiler.onSpinWait.TestOnSpinWaitAArch64$Launcher::test at 0` and `invokestatic onSpinWait`. They identify the place where to search instructions. That's not great. C2 is free to move stuff around, so it's not certain this test will keep working. If you just want to make sure that the pattern is used, a block_comment() would be more reliable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3027010064 From xgong at openjdk.org Wed Jul 2 08:59:47 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 2 Jul 2025 08:59:47 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: <7GrGfBF_v8F0v02sRHC78ofMZwpMdzQZaHeYlNvi_N0=.93defb9e-ca9b-41b4-8722-1746692e2316@github.com> References: <7GrGfBF_v8F0v02sRHC78ofMZwpMdzQZaHeYlNvi_N0=.93defb9e-ca9b-41b4-8722-1746692e2316@github.com> Message-ID: On Wed, 2 Jul 2025 01:52:19 GMT, Xiaohong Gong wrote: >> Agree with Paul, these are minor regressions. Let us proceed with this patch. > >> Agree with Paul, these are minor regressions. Let us proceed with this patch. > > Thanks so much for your review @sviswa7 ! > @XiaohongGong I quickly scanned the patch, it looks good to me too. I'm submitting some internal testing now, to make sure our extended testing does not break on integration. Should take about 24h. Good to know that. Thanks so much for your testing @eme64 ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3027032342 From roland at openjdk.org Wed Jul 2 09:00:30 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 2 Jul 2025 09:00:30 GMT Subject: RFR: 8275202: C2: optimize out more redundant conditions [v6] In-Reply-To: <978cgwy3Nb_x7yU6jZz0f6zhTBZfphstisAkBf1Vktc=.283d06eb-4f79-40cf-b8dd-a9c230e59902@github.com> References: <978cgwy3Nb_x7yU6jZz0f6zhTBZfphstisAkBf1Vktc=.283d06eb-4f79-40cf-b8dd-a9c230e59902@github.com> Message-ID: > This change adds a new loop opts pass to optimize redundant conditions > such as the second one in: > > > if (i < 10) { > if (i < 42) { > > > In the branch of the first if, the type of i can be narrowed down to > [min_jint, 9] which can then be used to constant fold the second > condition. > > The compiler already keeps track of type[n] for every node in the > current compilation unit. That's not sufficient to optimize the > snippet above though because the type of i can only be narrowed in > some sections of the control flow (that is a subset of all > controls). The solution is to build a new table that tracks the type > of n at every control c > > > type'[n, root] = type[n] // initialized from igvn's type table > type'[n, c] = type[n, idom(c)] > > > This pass iterates over the CFG looking for conditions such as: > > > if (i < 10) { > > > that allows narrowing the type of i and updates the type' table > accordingly. > > At a region r: > > > type'[n, r] = meet(type'[n, r->in(1)], type'[n, r->in(2)]...) > > > For a Phi phi at a region r: > > > type'[phi, r] = meet(type'[phi->in(1), r->in(1)], type'[phi->in(2), r->in(2)]...) > > > Once a type is narrowed, uses are enqueued and their types are > computed by calling the Value() methods. If a use's type is narrowed, > it's recorded at c in the type' table. Value() methods retrieve types > from the type table, not the type' table. To address that issue while > leaving Value() methods unchanged, before calling Value() at c, the > type table is updated so: > > > type[n] = type'[n, c] > > > An exception is for Phi::Value which needs to retrieve the type of > nodes are various controls: there, a new type(Node* n, Node* c) > method is used. > > For most n and c, type'[n, c] is likely the same as type[n], the type > recorded in the global igvn table (that is there shouldn't be many > nodes at only a few control for which we can narrow the type down). As > a consequence, the types'[n, c] table is implemented with: > > - At c, narrowed down types are stored in a GrowableArray. Each entry > records the previous type at idom(c) and the narrowed down type at > c. > > - The GrowableArray of type updates is recorded in a hash table > indexed by c. If there's no update at c, there's no entry in the > hash table. > > This pass operates in 2 steps: > > - it first iterates over the graph looking for conditions that narrow > the types of some nodes and propagate type updates to uses until a > fix point. > > - it transforms the graph so newly found constant nodes are folded. > > > The new pass is run on every loop opts. There are a couple rea... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: - more - more - more - Merge branch 'master' into JDK-8275202 - more - more - more - Merge branch 'master' into JDK-8275202 - review - Update src/hotspot/share/opto/loopConditionalPropagation.cpp Co-authored-by: Roberto Casta?eda Lozano - ... and 4 more: https://git.openjdk.org/jdk/compare/c220b135...9d093971 ------------- Changes: https://git.openjdk.org/jdk/pull/14586/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14586&range=05 Stats: 4588 lines in 34 files changed: 4483 ins; 40 del; 65 mod Patch: https://git.openjdk.org/jdk/pull/14586.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/14586/head:pull/14586 PR: https://git.openjdk.org/jdk/pull/14586 From roland at openjdk.org Wed Jul 2 09:02:44 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 2 Jul 2025 09:02:44 GMT Subject: RFR: 8275202: C2: optimize out more redundant conditions [v3] In-Reply-To: References: <978cgwy3Nb_x7yU6jZz0f6zhTBZfphstisAkBf1Vktc=.283d06eb-4f79-40cf-b8dd-a9c230e59902@github.com> Message-ID: On Mon, 9 Jun 2025 07:35:10 GMT, Roberto Casta?eda Lozano wrote: > I tested this changeset applied on top of jdk-25+26 (Oracle CI tier1-5) and found the following issues (besides the trivial `NULL` occurrence reported above): I pushed new commits that should address those failures. I added a test case for that one (a tricky issue): > * `assert(c->_idx >= _unique || _type_table->find_type_between(c, c, _phase->C->root()) != Type::TOP) failed: for If we don't follow dead projections` in multiple tests, e.g. `compiler/predicates/TestHoistedPredicateForNonRangeCheck.java` and `compiler/predicates/assertion/TestOpaqueInitializedAssertionPredicateNode.java`. New commits also include some tweaks and cleanup. @robcasloz would you mind running tests again? ------------- PR Comment: https://git.openjdk.org/jdk/pull/14586#issuecomment-3027043159 From xgong at openjdk.org Wed Jul 2 09:02:46 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 2 Jul 2025 09:02:46 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v2] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 08:15:34 GMT, Andrew Haley wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Refine comments based on review suggestion > > src/hotspot/cpu/aarch64/aarch64.ad line 2367: > >> 2365: // Theoretically, the minimal vector length supported by AArch64 >> 2366: // ISA and Vector API species is 64-bit. However, 32-bit or 16-bit >> 2367: // vector length is also allowed for special Vector API usages. > > Suggestion: > > // Usually, the shortest vector length supported by AArch64 > // ISA and Vector API species is 64 bits. However, we allow > // 32-bit or 16-bit vectors in a few special cases. > > > Reason for change: it wasn't clear what "supported" meant. Supported by the hardware, or by HotSpot. And why do we only support it in a few special cases? This comment raises more questions than it answers. Thanks so much for your suggestion! Looks better to me. I will update soon. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26057#discussion_r2179517582 From tkurashige at openjdk.org Wed Jul 2 09:08:43 2025 From: tkurashige at openjdk.org (Taizo Kurashige) Date: Wed, 2 Jul 2025 09:08:43 GMT Subject: RFR: 8359120: Improve warning message when fail to load hsdis library [v2] In-Reply-To: <-i6UPk-bhy9RnnCus_JbJ1nQ63nMX9djubON9WBbHQ8=.a2305566-563e-4171-b526-bcd645de51a3@github.com> References: <-i6UPk-bhy9RnnCus_JbJ1nQ63nMX9djubON9WBbHQ8=.a2305566-563e-4171-b526-bcd645de51a3@github.com> Message-ID: On Mon, 23 Jun 2025 08:56:12 GMT, Taizo Kurashige wrote: >> This PR is improvement of warning message when fail to load hsdis library. >> >> [JDK-8287001](https://bugs.openjdk.org/browse/JDK-8287001) introduced a warning on hsdis library load failure. This is useful when the user executes -XX:+PrintAssembly, etc. >> >> However, I think that when hs_err occurs, users might be confused by this warning printed by Xlog. Because users are not likely to know that hsdis is loaded for the [MachCode] section of the hs_err report, they may wonder, for example, "Why do I get warnings about hsdis load errors when -XX:+PrintAssembly is not specified?." >> >> To clear up this confusion, I suggest printing a warning just before [MachCode]. >> >>
>> >> sample output >> >> If hs_err occurs and hsdis load fails without the option to specify where the hs_err report should be output, the following is output to the hs_err_pir log file: >> >> . >> . >> native method entry point (kind = native) [0x000001ae8753cec0, 0x000001ae8753dac0] 3072 bytes >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> 0x000001ae8753cec0: 488b 4b08 | 0fb7 492e | 584c 8d74 | ccf8 6800 | 0000 0068 | 0000 0000 | 5055 488b | ec41 5548 >> 0x000001ae8753cee0: 8b43 084c | 8d68 3848 | 8b40 0868 | 0000 0000 | 5348 8b50 | 18 >> . >> . >> >> >> If -XX:+PrintAssembly is specified and hsdis load fails, the following is output to the stdout >> >> $ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -version >> OpenJDK 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output >> >> ============================= C1-compiled nmethod ============================== >> ----------------------------------- Assembly ----------------------------------- >> >> Compiled method (c1) 57 2 3 java.lang.Object:: (1 bytes) >> total in heap [0x0000024a08a00008,0x0000024a08a00208] = 512 >> . >> . >> >> [Constant Pool (empty)] >> >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> [Instructions begin] >> 0x0000024a08a00100: 6666 660f | 1f84 0000 | 0000 0066 | 6666 9066 | 6690 448b | 5208 443b >> . >> . >> [Constant Pool (empty)] >> >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> [Verified Entry Point] >> # {method} {0x00000000251a1898} 'toUnsignedInt' '(B)I' in 'java/lang/Byte >> . >> . >> >> >>
>> >> Since... > > Taizo Kurashige has updated the pull request incrementally with one additional commit since the last revision: > > Fix message and revert lines for Xlog Thank you for your review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25726#issuecomment-3027058301 From duke at openjdk.org Wed Jul 2 09:08:43 2025 From: duke at openjdk.org (duke) Date: Wed, 2 Jul 2025 09:08:43 GMT Subject: RFR: 8359120: Improve warning message when fail to load hsdis library [v2] In-Reply-To: <-i6UPk-bhy9RnnCus_JbJ1nQ63nMX9djubON9WBbHQ8=.a2305566-563e-4171-b526-bcd645de51a3@github.com> References: <-i6UPk-bhy9RnnCus_JbJ1nQ63nMX9djubON9WBbHQ8=.a2305566-563e-4171-b526-bcd645de51a3@github.com> Message-ID: On Mon, 23 Jun 2025 08:56:12 GMT, Taizo Kurashige wrote: >> This PR is improvement of warning message when fail to load hsdis library. >> >> [JDK-8287001](https://bugs.openjdk.org/browse/JDK-8287001) introduced a warning on hsdis library load failure. This is useful when the user executes -XX:+PrintAssembly, etc. >> >> However, I think that when hs_err occurs, users might be confused by this warning printed by Xlog. Because users are not likely to know that hsdis is loaded for the [MachCode] section of the hs_err report, they may wonder, for example, "Why do I get warnings about hsdis load errors when -XX:+PrintAssembly is not specified?." >> >> To clear up this confusion, I suggest printing a warning just before [MachCode]. >> >>
>> >> sample output >> >> If hs_err occurs and hsdis load fails without the option to specify where the hs_err report should be output, the following is output to the hs_err_pir log file: >> >> . >> . >> native method entry point (kind = native) [0x000001ae8753cec0, 0x000001ae8753dac0] 3072 bytes >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> 0x000001ae8753cec0: 488b 4b08 | 0fb7 492e | 584c 8d74 | ccf8 6800 | 0000 0068 | 0000 0000 | 5055 488b | ec41 5548 >> 0x000001ae8753cee0: 8b43 084c | 8d68 3848 | 8b40 0868 | 0000 0000 | 5348 8b50 | 18 >> . >> . >> >> >> If -XX:+PrintAssembly is specified and hsdis load fails, the following is output to the stdout >> >> $ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -version >> OpenJDK 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output >> >> ============================= C1-compiled nmethod ============================== >> ----------------------------------- Assembly ----------------------------------- >> >> Compiled method (c1) 57 2 3 java.lang.Object:: (1 bytes) >> total in heap [0x0000024a08a00008,0x0000024a08a00208] = 512 >> . >> . >> >> [Constant Pool (empty)] >> >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> [Instructions begin] >> 0x0000024a08a00100: 6666 660f | 1f84 0000 | 0000 0066 | 6666 9066 | 6690 448b | 5208 443b >> . >> . >> [Constant Pool (empty)] >> >> >> Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section >> [MachCode] >> [Verified Entry Point] >> # {method} {0x00000000251a1898} 'toUnsignedInt' '(B)I' in 'java/lang/Byte >> . >> . >> >> >>
>> >> Since... > > Taizo Kurashige has updated the pull request incrementally with one additional commit since the last revision: > > Fix message and revert lines for Xlog @kurashige23 Your change (at version 6ff4f9b5a3f6302ae4605ee985755fbccd3e24fb) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25726#issuecomment-3027062091 From tkurashige at openjdk.org Wed Jul 2 09:24:44 2025 From: tkurashige at openjdk.org (Taizo Kurashige) Date: Wed, 2 Jul 2025 09:24:44 GMT Subject: Integrated: 8359120: Improve warning message when fail to load hsdis library In-Reply-To: References: Message-ID: On Tue, 10 Jun 2025 13:38:03 GMT, Taizo Kurashige wrote: > This PR is improvement of warning message when fail to load hsdis library. > > [JDK-8287001](https://bugs.openjdk.org/browse/JDK-8287001) introduced a warning on hsdis library load failure. This is useful when the user executes -XX:+PrintAssembly, etc. > > However, I think that when hs_err occurs, users might be confused by this warning printed by Xlog. Because users are not likely to know that hsdis is loaded for the [MachCode] section of the hs_err report, they may wonder, for example, "Why do I get warnings about hsdis load errors when -XX:+PrintAssembly is not specified?." > > To clear up this confusion, I suggest printing a warning just before [MachCode]. > >
> > sample output > > If hs_err occurs and hsdis load fails without the option to specify where the hs_err report should be output, the following is output to the hs_err_pir log file: > > . > . > native method entry point (kind = native) [0x000001ae8753cec0, 0x000001ae8753dac0] 3072 bytes > > Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section > [MachCode] > 0x000001ae8753cec0: 488b 4b08 | 0fb7 492e | 584c 8d74 | ccf8 6800 | 0000 0068 | 0000 0000 | 5055 488b | ec41 5548 > 0x000001ae8753cee0: 8b43 084c | 8d68 3848 | 8b40 0868 | 0000 0000 | 5348 8b50 | 18 > . > . > > > If -XX:+PrintAssembly is specified and hsdis load fails, the following is output to the stdout > > $ java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -version > OpenJDK 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output > > ============================= C1-compiled nmethod ============================== > ----------------------------------- Assembly ----------------------------------- > > Compiled method (c1) 57 2 3 java.lang.Object:: (1 bytes) > total in heap [0x0000024a08a00008,0x0000024a08a00208] = 512 > . > . > > [Constant Pool (empty)] > > > Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section > [MachCode] > [Instructions begin] > 0x0000024a08a00100: 6666 660f | 1f84 0000 | 0000 0066 | 6666 9066 | 6690 448b | 5208 443b > . > . > [Constant Pool (empty)] > > > Loading hsdis library failed, so undisassembled code is printed in the below [MachCode] section > [MachCode] > [Verified Entry Point] > # {method} {0x00000000251a1898} 'toUnsignedInt' '(B)I' in 'java/lang/Byte > . > . > > >
> > Since the warning added in this fix cover the role of warning introduced in [JDK-8287001](https://bugs.openjdk.org/browse/JDK-828... This pull request has now been integrated. Changeset: ce998699 Author: Taizo Kurashige Committer: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/ce9986991d60e116ac6680a1b6a4b3ee5384d105 Stats: 9 lines in 2 files changed: 9 ins; 0 del; 0 mod 8359120: Improve warning message when fail to load hsdis library Reviewed-by: mhaessig, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/25726 From rrich at openjdk.org Wed Jul 2 09:36:15 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 2 Jul 2025 09:36:15 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining Message-ID: This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. Testing: x86_64, ppc64 Failed inlining on x86_64 with TieredCompilation disabled: make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 [...] STDOUT: CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) @ 1 java.lang.Object:: (1 bytes) inline (hot) @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) s @ 1 java.lang.StringBuffer::length (5 bytes) accessor @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor 2025-07-02T09:25:53.396634900Z Attempt 1, found: false 2025-07-02T09:25:53.415673072Z Attempt 2, found: false 2025-07-02T09:25:53.418876867Z Attempt 3, found: false [...] ------------- Commit messages: - Force inlining of String*.* methods - Force inlining of StringBuffer methods Changes: https://git.openjdk.org/jdk/pull/26033/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26033&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360599 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26033.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26033/head:pull/26033 PR: https://git.openjdk.org/jdk/pull/26033 From rrich at openjdk.org Wed Jul 2 09:36:15 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 2 Jul 2025 09:36:15 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining In-Reply-To: References: Message-ID: On Sun, 29 Jun 2025 15:26:14 GMT, Richard Reingruber wrote: > This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. > > Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. > > Testing: x86_64, ppc64 > > Failed inlining on x86_64 with TieredCompilation disabled: > > > make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 > > [...] > > STDOUT: > CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true > @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) > @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) > @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) > @ 1 java.lang.Object:: (1 bytes) inline (hot) > @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) > s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method > s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) > s @ 1 java.lang.StringBuffer::length (5 bytes) accessor > @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method > @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor > 2025-07-02T09:25:53.396634900Z Attempt 1, found: false > 2025-07-02T09:25:53.415673072Z Attempt 2, found: false > 2025-07-02T09:25:53.418876867Z Attempt 3, found: false > > [...] Jit compiler folks might want to have a look at this pr. Maybe there's a better for having the StringBuilder locks eliminated deterministically. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26033#issuecomment-3027054192 From mdoerr at openjdk.org Wed Jul 2 09:54:41 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Jul 2025 09:54:41 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining In-Reply-To: References: Message-ID: On Sun, 29 Jun 2025 15:26:14 GMT, Richard Reingruber wrote: > This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. > > Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. > > Testing: x86_64, ppc64 > > Failed inlining on x86_64 with TieredCompilation disabled: > > > make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 > > [...] > > STDOUT: > CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true > @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) > @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) > @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) > @ 1 java.lang.Object:: (1 bytes) inline (hot) > @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) > s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method > s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) > s @ 1 java.lang.StringBuffer::length (5 bytes) accessor > @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method > @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor > 2025-07-02T09:25:53.396634900Z Attempt 1, found: false > 2025-07-02T09:25:53.415673072Z Attempt 2, found: false > 2025-07-02T09:25:53.418876867Z Attempt 3, found: false > > [...] LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26033#pullrequestreview-2978510981 From bmaillard at openjdk.org Wed Jul 2 10:03:23 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 2 Jul 2025 10:03:23 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v2] In-Reply-To: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: <3wDbLni8c6Up8_W56fFOv_meffgHHjzch0e3QESao1A=.03a7c7a7-787d-4fb0-b081-64865636bf14@github.com> > This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. > > By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) > - [x] tier1-3, plus some internal testing > > Thank you for reviewing! Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: 8361144: update comment Co-authored-by: Damon Fenacci ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26064/files - new: https://git.openjdk.org/jdk/pull/26064/files/e06b4d53..28851936 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26064&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26064&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26064.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26064/head:pull/26064 PR: https://git.openjdk.org/jdk/pull/26064 From bmaillard at openjdk.org Wed Jul 2 10:19:59 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 2 Jul 2025 10:19:59 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v3] In-Reply-To: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: > This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. > > By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) > - [x] tier1-3, plus some internal testing > > Thank you for reviewing! Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: 8361144: add comment for consistency with node count ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26064/files - new: https://git.openjdk.org/jdk/pull/26064/files/28851936..75f81296 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26064&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26064&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26064.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26064/head:pull/26064 PR: https://git.openjdk.org/jdk/pull/26064 From bmaillard at openjdk.org Wed Jul 2 10:20:00 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 2 Jul 2025 10:20:00 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v3] In-Reply-To: References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Wed, 2 Jul 2025 06:55:34 GMT, Damon Fenacci wrote: >> Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: >> >> 8361144: add comment for consistency with node count > > src/hotspot/share/opto/phaseX.cpp line 1821: > >> 1819: // The number of nodes shoud not increase. >> 1820: uint old_unique = C->unique(); >> 1821: uint old_hash = n->hash(); > > Just to be consistent with `old_unique` we could add a small comment (here or below for both). What do you think? Sounds reasonable! Made the update ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26064#discussion_r2179682863 From shade at openjdk.org Wed Jul 2 10:20:48 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 2 Jul 2025 10:20:48 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems Message-ID: We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): Before: Done (2487 classes, 9866 methods, 24584 ms) After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods Additional testing: - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` ------------- Commit messages: - Move clinit compile back - Initial - Fix Changes: https://git.openjdk.org/jdk/pull/26090/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26090&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361255 Stats: 41 lines in 2 files changed: 35 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/26090.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26090/head:pull/26090 PR: https://git.openjdk.org/jdk/pull/26090 From bmaillard at openjdk.org Wed Jul 2 10:24:38 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 2 Jul 2025 10:24:38 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v3] In-Reply-To: References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Tue, 1 Jul 2025 13:45:04 GMT, Galder Zamarre?o wrote: > Have you considered adding a test for this? Is that feasible? @galderz I have considered doing it, but there is no known case that triggers the condition. This change was suggested by @eme64 when discussing the related [JDK-8359602](https://bugs.openjdk.org/browse/JDK-8359602). ------------- PR Comment: https://git.openjdk.org/jdk/pull/26064#issuecomment-3027305541 From galder at openjdk.org Wed Jul 2 10:56:40 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 2 Jul 2025 10:56:40 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v3] In-Reply-To: References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Wed, 2 Jul 2025 10:19:59 GMT, Beno?t Maillard wrote: >> This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. >> >> By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) >> - [x] tier1-3, plus some internal testing >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > 8361144: add comment for consistency with node count Marked as reviewed by galder (Author). ------------- PR Review: https://git.openjdk.org/jdk/pull/26064#pullrequestreview-2978685354 From mdoerr at openjdk.org Wed Jul 2 11:01:49 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Jul 2025 11:01:49 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 Message-ID: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. ------------- Commit messages: - 8361259: JDK25: Backout JDK-8258229 Changes: https://git.openjdk.org/jdk/pull/26091/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26091&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361259 Stats: 93 lines in 2 files changed: 0 ins; 93 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26091.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26091/head:pull/26091 PR: https://git.openjdk.org/jdk/pull/26091 From yzheng at openjdk.org Wed Jul 2 11:28:47 2025 From: yzheng at openjdk.org (Yudi Zheng) Date: Wed, 2 Jul 2025 11:28:47 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt [v2] In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: On Fri, 27 Jun 2025 01:43:16 GMT, Mohamed Issa wrote: >> The changes described below are meant to resolve the performance regression introduced by the **x86_64 cbrt** double precision floating point scalar intrinsic in #24470. >> >> 1. Check for +0, -0, +INF, -INF, and NaN before any other input values. >> 2. If these special values are found, return immediately with minimal modifications to the result register. >> 3. Performance testing shows the modified intrinsic improves throughput by 65.1% over the original intrinsic on average for the special values while throughput drops by 5.5% for the normal value range (-INF, -2^(-1022)], [2^(-1022), INF). >> >> The commands to run all relevant micro-benchmarks are posted below. >> >> `make test TEST="micro:CbrtPerf.CbrtPerfRanges"` >> `make test TEST="micro:CbrtPerf.CbrtPerfSpecialValues"` >> >> The results of all tests posted below were captured with an [Intel? Xeon 8488C](https://www.intel.com/content/www/us/en/products/sku/231730/intel-xeon-platinum-8480c-processor-105m-cache-2-00-ghz/specifications.html) using [OpenJDK v26-b1](https://github.com/openjdk/jdk/releases/tag/jdk-26%2B1) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled. >> >> Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the changes provide a significant uplift over _baseline1_ except for a mild regression in the (**2^(-1022) <= |x| < INF**) input range, which is expected due to the extra checks. When comparing against _baseline2_, the modified intrinsic significantly still outperforms for the inputs (**-INF < x < INF**) that require heavy compute. However, the special value inputs that trigger fast path returns still perform better with _baseline2_. >> >> | Input range(s) | Baseline1 (ops/ms) | Change (ops/ms) | Change vs baseline1 (%) | >> | :-------------------------------------: | :-------------------: | :------------------: | :--------------------------: | >> | [-2^(-1022), 2^(-1022)] | 18470 | 20847 | +12.87 | >> | (-INF, -2^(-1022)], [2^(-1022), INF) | 210538 | 198925 | -5.52 | >> | [0] | 344990 | 627561 | +81.91 | >> | [-0] ... > > Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: > > Ensure ABS_MASK is a 128-bit memory sized location and only use equal enum for UCOMISD checks src/hotspot/cpu/x86/stubGenerator_x86_64_cbrt.cpp line 350: > 348: > 349: __ bind(L_2TAG_PACKET_6_0_1); > 350: __ movsd(xmm0, ExternalAddress(NEG_INF), r11 /*rscratch*/); note that `NEG_INF` is now unused ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25962#discussion_r2179808403 From mhaessig at openjdk.org Wed Jul 2 11:39:43 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 2 Jul 2025 11:39:43 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: <3IMUQfwLLDneX5SFYKzLTLk_queN_r2Q7VPC7B31vow=.d6f7600f-acf3-482f-88da-5e260cb16aa1@github.com> On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. The change and the proposed plans look good to me. Apologies for all the troubles I have caused. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26091#pullrequestreview-2978804263 From shade at openjdk.org Wed Jul 2 12:02:07 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 2 Jul 2025 12:02:07 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v2] In-Reply-To: References: Message-ID: > We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. > > The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. > > Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): > > > Before: Done (2487 classes, 9866 methods, 24584 ms) > After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods > > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' into JDK-8361255-ctw-ncdfe - Move clinit compile back - Initial - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26090/files - new: https://git.openjdk.org/jdk/pull/26090/files/ba0cc87b..9d41f80a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26090&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26090&range=00-01 Stats: 1189 lines in 72 files changed: 623 ins; 239 del; 327 mod Patch: https://git.openjdk.org/jdk/pull/26090.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26090/head:pull/26090 PR: https://git.openjdk.org/jdk/pull/26090 From thartmann at openjdk.org Wed Jul 2 12:00:39 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 2 Jul 2025 12:00:39 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26091#pullrequestreview-2978863065 From shade at openjdk.org Wed Jul 2 12:10:40 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 2 Jul 2025 12:10:40 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v2] In-Reply-To: References: Message-ID: <_U8Ws402jgYrpmU1GxnfiHkfein2Rsl1Rh4RKJFwvRQ=.5b74a2c1-5d67-4221-bce8-d00adeb63207@github.com> On Wed, 2 Jul 2025 12:02:07 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8361255-ctw-ncdfe > - Move clinit compile back > - Initial > - Fix Sanity-checking CTW times: $ time CONF=linux-x86_64-server-fastdebug make test TEST=applications/ctw/modules/ # Base real 3m49.952s user 67m50.313s sys 5m24.288s # This PR real 3m53.800s user 67m26.925s sys 5m22.429s ------------- PR Comment: https://git.openjdk.org/jdk/pull/26090#issuecomment-3027631058 From mdoerr at openjdk.org Wed Jul 2 13:03:38 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Jul 2025 13:03:38 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <3IMUQfwLLDneX5SFYKzLTLk_queN_r2Q7VPC7B31vow=.d6f7600f-acf3-482f-88da-5e260cb16aa1@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> <3IMUQfwLLDneX5SFYKzLTLk_queN_r2Q7VPC7B31vow=.d6f7600f-acf3-482f-88da-5e260cb16aa1@github.com> Message-ID: On Wed, 2 Jul 2025 11:36:38 GMT, Manuel H?ssig wrote: > Apologies for all the troubles I have caused. Never mind. The related code is quite tricky. And your problem analysis was good. Thanks for the 2 reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3027802039 From asmehra at openjdk.org Wed Jul 2 13:27:45 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 2 Jul 2025 13:27:45 GMT Subject: RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: <-pMpqoug81IwYPE7M1In40E0z5SHdeRM0Dianb9yzsM=.ad435e03-ba15-45ab-89c3-e5331b709735@github.com> On Tue, 1 Jul 2025 15:50:29 GMT, Vladimir Kozlov wrote: >> Please reivew this patch to fix initialization and freeing of `AOTCodeAddressTable::_stubs_addr`. Changes are trivial > > Yes, it is trivial. @vnkozlov @shipilev thanks for the review. Integrating it now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26053#issuecomment-3027875790 From asmehra at openjdk.org Wed Jul 2 13:27:46 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 2 Jul 2025 13:27:46 GMT Subject: Integrated: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 19:45:49 GMT, Ashutosh Mehra wrote: > Please reivew this patch to fix initialization and freeing of `AOTCodeAddressTable::_stubs_addr`. Changes are trivial This pull request has now been integrated. Changeset: 3066a67e Author: Ashutosh Mehra URL: https://git.openjdk.org/jdk/commit/3066a67e6279f7e3896ab545bc6c291d279d2b03 Stats: 4 lines in 2 files changed: 4 ins; 0 del; 0 mod 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly Reviewed-by: kvn, shade ------------- PR: https://git.openjdk.org/jdk/pull/26053 From missa at openjdk.org Wed Jul 2 14:58:50 2025 From: missa at openjdk.org (Mohamed Issa) Date: Wed, 2 Jul 2025 14:58:50 GMT Subject: RFR: 8358179: Performance regression in Math.cbrt [v2] In-Reply-To: References: <45l5EvxoRINI1_Ep2_snJzKNMPo4-dPXADalLN1fq1Y=.9f697a35-ee7b-4e7a-9e5e-ff33911b3b21@github.com> Message-ID: On Wed, 2 Jul 2025 11:25:33 GMT, Yudi Zheng wrote: >> Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision: >> >> Ensure ABS_MASK is a 128-bit memory sized location and only use equal enum for UCOMISD checks > > src/hotspot/cpu/x86/stubGenerator_x86_64_cbrt.cpp line 350: > >> 348: >> 349: __ bind(L_2TAG_PACKET_6_0_1); >> 350: __ movsd(xmm0, ExternalAddress(NEG_INF), r11 /*rscratch*/); > > note that `NEG_INF` is now unused Got it - thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25962#discussion_r2180280115 From asmehra at openjdk.org Wed Jul 2 15:06:17 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 2 Jul 2025 15:06:17 GMT Subject: [jdk25] RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly Message-ID: Backporting the fix to jdk25 ------------- Commit messages: - Backport 3066a67e6279f7e3896ab545bc6c291d279d2b03 Changes: https://git.openjdk.org/jdk/pull/26095/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26095&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361101 Stats: 4 lines in 2 files changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26095.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26095/head:pull/26095 PR: https://git.openjdk.org/jdk/pull/26095 From shade at openjdk.org Wed Jul 2 16:00:44 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 2 Jul 2025 16:00:44 GMT Subject: [jdk25] RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 14:56:23 GMT, Ashutosh Mehra wrote: > Backporting the fix to jdk25 Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26095#pullrequestreview-2979750627 From lmesnik at openjdk.org Wed Jul 2 16:29:39 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Wed, 2 Jul 2025 16:29:39 GMT Subject: RFR: 8357739: [jittester] disable the hashCode method In-Reply-To: References: Message-ID: On Tue, 17 Jun 2025 19:49:34 GMT, Evgeny Nikitin wrote: > JITTester often uses the `hasCode` method (in fact, in almost every generated test). Given that the method can be unstable between runs or in interpreted vs compiled runs, it can create false-positives. > > This PR fixes the issue by adding support for method templates similar to the ones used in CompilerCommands). All of those exclude templates match (and exclude) `String.indexOf(String)`, for example: > > java/lang/::*(Ljava/lang/String;I) > *String::indexOf(*) > java/lang/*::indexOf > > > Additionally, the PR adds support for comments (starting from '#') and empty lines in the excludes file. Marked as reviewed by lmesnik (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25859#pullrequestreview-2979848692 From vpaprotski at openjdk.org Wed Jul 2 17:30:51 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Wed, 2 Jul 2025 17:30:51 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 06:45:58 GMT, Jatin Bhateja wrote: > For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker ... Using the suggested code as a base, Vamsi and I tinkered with the idea some more! Here is what we came up with. This also tracks the correct order of registers being pushed/poped.. (haven't compiled it, so might have some syntax bugs). @dholmes-ora would you mind sharing your opinion? We seem to be making things more complicated, but hopefully in a good way? Also included a sample usage in a stub. #define __ _masm-> class PushPopTracker { private: int _counter; MacroAssembler *_masm; const int REGS = 32; // Increase as needed int regs[REGS]; public: PushPopTracker(MacroAssembler *_masm) : _counter(0), _masm(_masm) {} ~PushPopTracker() { assert(_counter == 0, "Push/pop pair mismatch"); } void push(Register reg) { assert(_counter0, "Push/pop underflow"); assert(regs[_counter] == reg.encoding(), "Push/pop pair mismatch: %d != %d", regs[_counter], reg.encoding()); _counter--; if (VM_Version::supports_apx_f()) { __ popp(reg); } else { __ pop(reg); } } } address StubGenerator::generate_intpoly_montgomeryMult_P256() { __ align(CodeEntryAlignment); /*...*/ address start = __ pc(); __ enter(); PushPopTracker s(_masm); s.push(r12); //1 s.push(r13); //2 s.push(r14); //3 #ifdef _WIN64 s.push(rsi); //4 s.push(rdi); //5 #endif s.push(rbp); //6 __ movq(rbp, rsp); __ andq(rsp, -32); __ subptr(rsp, 32); // Register Map const Register aLimbs = c_rarg0; // c_rarg0: rdi | rcx const Register bLimbs = rsi; // c_rarg1: rsi | rdx const Register rLimbs = r8; // c_rarg2: rdx | r8 const Register tmp1 = r9; const Register tmp2 = r10; /*...*/ __ movq(rsp, rbp); s.pop(rbp); //5 #ifdef _WIN64 s.pop(rdi); //4 s.pop(rsi); //3 #endif s.pop(r14); //2 s.pop(r13); //1 s.pop(r12); //0 __ leave(); __ ret(0); return start; } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2180606586 From jbhateja at openjdk.org Wed Jul 2 17:47:41 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 2 Jul 2025 17:47:41 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 17:27:41 GMT, Volodymyr Paprotski wrote: >> For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker in the stub snippets using push/pop instruction sequence and wrap the actual assembler call underneath. The idea here is to catch the balancing error upfront as PPX is purely a performance hint. Instructions with this hint have the same functional semantics as those without. PPX hints set by the compiler that violate the balancing rule may turn off the PPX >> optimization, but they will not affect program semantics.. >> >> >> class APXPushPopPairTracker { >> private: >> int _counter; >> >> public: >> APXPushPopPairTracker() _counter(0) { >> } >> >> ~APXPushPopPairTracker() { >> assert(_counter == 0, "Push/pop pair mismatch"); >> } >> >> void push(Register reg, bool has_matching_pop) { >> if (has_matching_pop && VM_Version::supports_apx_f()) { >> Assembler::pushp(reg); >> incrementCounter(); >> } else { >> Assembler::push(reg); >> } >> } >> void pop(Register reg, bool has_matching_push) { >> if (has_matching_push && VM_Version::supports_apx_f()) { >> Assembler::popp(reg); >> decrementCounter(); >> } else { >> Assembler::pop(reg); >> } >> } >> void incrementCounter() { >> _counter++; >> } >> void decrementCounter() { >> _counter--; >> } >> } > >> For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker ... > > Using the suggested code as a base, Vamsi and I tinkered with the idea some more! Here is what we came up with. This also tracks the correct order of registers being pushed/poped.. (haven't compiled it, so might have some syntax bugs). > > @dholmes-ora would you mind sharing your opinion? We seem to be making things more complicated, but hopefully in a good way? > > Also included a sample usage in a stub. > > > #define __ _masm-> > > class PushPopTracker { > private: > int _counter; > MacroAssembler *_masm; > const int REGS = 32; // Increase as needed > int regs[REGS]; > public: > PushPopTracker(MacroAssembler *_masm) : _counter(0), _masm(_masm) {} > ~PushPopTracker() { > assert(_counter == 0, "Push/pop pair mismatch"); > } > > void push(Register reg) { > assert(_counter regs[_counter++] = reg.encoding(); > if (VM_Version::supports_apx_f()) { > __ pushp(reg); > } else { > __ push(reg); > } > } > void pop(Register reg) { > assert(_counter>0, "Push/pop underflow"); > assert(regs[_counter] == reg.encoding(), "Push/pop pair mismatch: %d != %d", regs[_counter], reg.encoding()); > _counter--; > if (VM_Version::supports_apx_f()) { > __ popp(reg); > } else { > __ pop(reg); > } > } > } > > address StubGenerator::generate_intpoly_montgomeryMult_P256() { > __ align(CodeEntryAlignment); > /*...*/ > address start = __ pc(); > __ enter(); > PushPopTracker s(_masm); > s.push(r12); //1 > s.push(r13); //2 > s.push(r14); //3 > #ifdef _WIN64 > s.push(rsi); //4 > s.push(rdi); //5 > #endif > s.push(rbp); //6 > __ movq(rbp, rsp); > __ andq(rsp, -32); > __ subptr(rsp, 32); > // Register Map > const Register aLimbs = c_rarg0; // c_rarg0: rdi | rcx > const Register bLimbs = rsi; // c_rarg1: rsi | rdx > const Register rLimbs = r8; // c_rarg2: rdx | r8 > const Register tmp1 = r9; > const Register tmp2 = r10; > /*...*/ > __ movq(rsp, rbp); > s.pop(rbp); //5 > #ifdef _WIN64 > s.pop(rdi); //4 > s.pop(rsi); //3 > #endif > s.pop(r14); //2 > s.pop(r13); //1 > s.pop(r12); //0 > __ leave(); > __ ret(0); > return start; > } @vamsi-parasa, It's better to make this as a subclass of MacroAssembler in src/hotspot/cpu/x86/macroAssembler_x86.hpp and pass Tracker as an argument to push / pop for a cleaner interface. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2180636365 From sviswanathan at openjdk.org Wed Jul 2 17:49:49 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 2 Jul 2025 17:49:49 GMT Subject: RFR: 8360116: Add support for AVX10 floating point minmax instruction [v6] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 01:57:46 GMT, Jatin Bhateja wrote: >> Intel@ AVX10 ISA [1] extensions added new floating point MIN/MAX instructions which comply with definitions in IEEE-754-2019 standard section 9.6 and can directly emulate Math.min/max semantics without the need for any special handling for NaN, +0.0 or -0.0 detection. >> >> **The following pseudo-code describes the existing algorithm for min/max[FD]:** >> >> Move the non-negative value to the second operand; this will ensure that we correctly handle 0.0 and -0.0 values, if values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. Existing MINPS and MAXPS semantics only check for NaN as the second operand; hence, we need special handling to check for NaN at the first operand. >> >> btmp = (b < +0.0) ? a : b >> atmp = (b < +0.0) ? b : a >> Tmp = Max_Float(atmp , btmp) >> Res = (atmp == NaN) ? atmp : Tmp >> >> For min[FD] we need a small tweak in the above algorithm, i.e., move the non-negative value to the first operand, this will ensure that we correctly select -0.0 if both the operands being compared are 0.0 or -0.0. >> >> btmp = (b < +0.0) ? b : a >> atmp = (b < +0.0) ? a : b >> Tmp = Max_Float(atmp , btmp) >> Res = (atmp == NaN) ? atmp : Tmp >> >> Thus, we need additional special handling for NaNs and +/-0.0 to compute floating-point min/max values to comply with the semantics of Math.max/min APIs using existing MINPS / MAXPS instructions. AVX10.2 added a new instruction, VPMINMAX[SH,SS,SD]/[PH,PS,PD], which comprehensively handles special cases, thereby eliminating the need for special handling. >> >> Patch emits new instructions for reduction and non-reduction operations for single, double, and Float16 type. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin >> >> [1] https://www.intel.com/content/www/us/en/content-details/856721/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html?wapkw=AVX10 > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Sandhya's review comments resolution Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25914#pullrequestreview-2980096085 From jbhateja at openjdk.org Wed Jul 2 17:49:49 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 2 Jul 2025 17:49:49 GMT Subject: RFR: 8360116: Add support for AVX10 floating point minmax instruction [v6] In-Reply-To: References: Message-ID: <69bq-sgmNdZBGkcLyGo1dccJoCcC04FacUZW4CPHqkE=.ab942681-27ab-4ed6-b425-66b8487b9ab8@github.com> On Wed, 2 Jul 2025 17:45:29 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Sandhya's review comments resolution > > Looks good to me. Thanks @sviswa7 and @mhaessig for approvals. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25914#issuecomment-3028779791 From jbhateja at openjdk.org Wed Jul 2 17:49:50 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 2 Jul 2025 17:49:50 GMT Subject: Integrated: 8360116: Add support for AVX10 floating point minmax instruction In-Reply-To: References: Message-ID: On Fri, 20 Jun 2025 11:08:54 GMT, Jatin Bhateja wrote: > Intel@ AVX10 ISA [1] extensions added new floating point MIN/MAX instructions which comply with definitions in IEEE-754-2019 standard section 9.6 and can directly emulate Math.min/max semantics without the need for any special handling for NaN, +0.0 or -0.0 detection. > > **The following pseudo-code describes the existing algorithm for min/max[FD]:** > > Move the non-negative value to the second operand; this will ensure that we correctly handle 0.0 and -0.0 values, if values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. Existing MINPS and MAXPS semantics only check for NaN as the second operand; hence, we need special handling to check for NaN at the first operand. > > btmp = (b < +0.0) ? a : b > atmp = (b < +0.0) ? b : a > Tmp = Max_Float(atmp , btmp) > Res = (atmp == NaN) ? atmp : Tmp > > For min[FD] we need a small tweak in the above algorithm, i.e., move the non-negative value to the first operand, this will ensure that we correctly select -0.0 if both the operands being compared are 0.0 or -0.0. > > btmp = (b < +0.0) ? b : a > atmp = (b < +0.0) ? a : b > Tmp = Max_Float(atmp , btmp) > Res = (atmp == NaN) ? atmp : Tmp > > Thus, we need additional special handling for NaNs and +/-0.0 to compute floating-point min/max values to comply with the semantics of Math.max/min APIs using existing MINPS / MAXPS instructions. AVX10.2 added a new instruction, VPMINMAX[SH,SS,SD]/[PH,PS,PD], which comprehensively handles special cases, thereby eliminating the need for special handling. > > Patch emits new instructions for reduction and non-reduction operations for single, double, and Float16 type. > > Kindly review and share your feedback. > > Best Regards, > Jatin > > [1] https://www.intel.com/content/www/us/en/content-details/856721/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html?wapkw=AVX10 This pull request has now been integrated. Changeset: 5e30bf68 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/5e30bf68353d989aadc2d8176181226b2debd283 Stats: 465 lines in 7 files changed: 423 ins; 4 del; 38 mod 8360116: Add support for AVX10 floating point minmax instruction Reviewed-by: mhaessig, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/25914 From asmehra at openjdk.org Wed Jul 2 17:52:48 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 2 Jul 2025 17:52:48 GMT Subject: [jdk25] Integrated: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: <5B6SRjnVrt014BK6iJT8kEIv_qoyJ74xh0bE5VCVoOg=.8fabdd60-a675-4741-b741-08b6c5f44b99@github.com> On Wed, 2 Jul 2025 14:56:23 GMT, Ashutosh Mehra wrote: > Backporting the fix to jdk25 This pull request has now been integrated. Changeset: ab013962 Author: Ashutosh Mehra URL: https://git.openjdk.org/jdk/commit/ab013962093a427ae0f2acac82748d0c9f86ab3f Stats: 4 lines in 2 files changed: 4 ins; 0 del; 0 mod 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly Reviewed-by: shade Backport-of: 3066a67e6279f7e3896ab545bc6c291d279d2b03 ------------- PR: https://git.openjdk.org/jdk/pull/26095 From asmehra at openjdk.org Wed Jul 2 18:01:44 2025 From: asmehra at openjdk.org (Ashutosh Mehra) Date: Wed, 2 Jul 2025 18:01:44 GMT Subject: [jdk25] RFR: 8361101: AOTCodeAddressTable::_stubs_addr not initialized/freed properly In-Reply-To: References: Message-ID: <_2YlrJyouZjttbLFcWchpFVh-fRdt6p6crYJEND1kH8=.b14e5332-225c-49b0-b50a-22a9163cdd73@github.com> On Wed, 2 Jul 2025 15:57:43 GMT, Aleksey Shipilev wrote: >> Backporting the fix to jdk25 > > Marked as reviewed by shade (Reviewer). Thanks @shipilev for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26095#issuecomment-3028825791 From vpaprotski at openjdk.org Wed Jul 2 18:35:40 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Wed, 2 Jul 2025 18:35:40 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 17:44:34 GMT, Jatin Bhateja wrote: > @vamsi-parasa, It's better to make this as a subclass of MacroAssembler in src/hotspot/cpu/x86/macroAssembler_x86.hpp and pass Tracker as an argument to push / pop for a cleaner interface. I don't think its possible? Unless I am missing something.. - Subclass has an instance of the base class (i.e. the memory allocation of `PushPopTracker` would have the `MacroAssembler` base class with extra fields appended); and `MacroAssembler` has already been allocated (i.e. you can't tack on more fields onto the end of the underlying memory of existing `MacroAssembler`..) - If its a subclass, there is no reason to pass it as a parameter, because it already would have the parent's instance? Also, the extra parameter to push/pop (flag) was what I had originally objected to? (i.e. would like for push/pop to still just take one register as a parameter..) - This class is sort of a stripped-down implementation of reference counting; we want the stack-allocated variable (i.e. explicit constructor call) and the implicit destructor calls (i.e. inserted by g++ on all function exits). That is, we must have a stack allocated variable for it to be deallocated (and destructor called for assert check) Here is an attempt to make it a subclass? And sample usage... class PushPopTracker : public MacroAssembler { private: int _counter; const int REGS = 32; // Increase as needed int regs[REGS]; public: // MacroAssembler(CodeBuffer* code) is the only constructor? PushPopTracker() : _counter(0), MacroAssembler(???) {} //FIXME??? ~PushPopTracker() { assert(_counter == 0, "Push/pop pair mismatch"); } void push(Register reg) { assert(_counter References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. LGTM. I'm running Oracle testing now. I'm not sure how to handle JDK-8357017 now in JBS. Close it as a duplicate of the backout? According to https://openjdk.org/guide, it sounds like it might have been more correct to use JDK-8357017 for the backout, and make it a subtask of JDK-8258229. @TobiHartmann @JesperIRL , what do you think? ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26091#pullrequestreview-2980343622 From mdoerr at openjdk.org Wed Jul 2 19:29:39 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Jul 2025 19:29:39 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: <0t_ct-w4lOpvbe4c8DJD9jgU-VRgbMRSVG_ibd8lpkU=.4ebc56ab-9d1c-4531-98fc-4bca442434b9@github.com> On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. Thanks for the review! [JDK-8357017](https://bugs.openjdk.org/browse/JDK-8357017) will be fixed by [JDK-8361259](https://bugs.openjdk.org/browse/JDK-8361259) in JDK25 and it is fixed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. So, my plan is to close JDK-8357017 as fixed referring to the other 2 issues. Does that make sense? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3029079372 From dlong at openjdk.org Wed Jul 2 20:13:41 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 2 Jul 2025 20:13:41 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: <9UlIKIQGn3vum3P71THXlyJwJ1efJmJNlImCpYErex8=.e794eff7-516d-4c5f-8e02-f15e5b34cba6@github.com> On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. Makes sense, but according to the Developers' Guide, we can't do that because "A Bug or Enhancement with resolution Fixed is required to have a corresponding changeset in one of the OpenJDK repositories." ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3029184612 From duke at openjdk.org Wed Jul 2 20:50:51 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 2 Jul 2025 20:50:51 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: <73AnlXOv0T8K25DgsNdH1PkBjcBXz0f3bBYZx44LpAw=.439f5383-ffd1-44e8-9e11-4b5af9b6a278@github.com> References: <73AnlXOv0T8K25DgsNdH1PkBjcBXz0f3bBYZx44LpAw=.439f5383-ffd1-44e8-9e11-4b5af9b6a278@github.com> Message-ID: <3f1UnDuYp2iYVcciKF-BqdChOOY2PJJG5R0QuyfblVM=.37a92dfd-5de4-4924-83c5-f9c2e5d7548c@github.com> On Tue, 1 Jul 2025 11:24:09 GMT, Evgeny Astigeevich wrote: >> Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: >> >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Update how call sites are fixed >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Fix pointer printing >> - Use set_destination_mt_safe >> - Print address as pointer >> - Use new _metadata_size instead of _jvmci_data_size >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Only check branch distance for aarch64 and riscv >> - Move far branch fix to fix_relocation_after_move >> - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e > > src/hotspot/share/code/nmethod.cpp line 1653: > >> 1651: } >> 1652: } >> 1653: } > > Do we need this code? Shouldn't missing trampolined be caught during fixing call sites? If fixing call sites fails (like in the event of a missing trampoline) an assert will fail and the JVM will crash. I suppose it could be updated to abandon the relocation if that happens but that would require `fix_relocation_after_move` to return if it succeeded and proper handling by the caller. > test/hotspot/jtreg/vmTestbase/nsk/jvmti/NMethodRelocation/nmethodrelocation.java line 37: > >> 35: import jdk.test.whitebox.code.BlobType; >> 36: >> 37: public class nmethodrelocation extends DebugeeClass { > > Why is the class name not following the Java code conventions? I was following the naming conventions of other JVMTI tests. https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/vmTestbase/nsk/jvmti ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2180937766 PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2180943465 From duke at openjdk.org Wed Jul 2 20:50:52 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 2 Jul 2025 20:50:52 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: <3f1UnDuYp2iYVcciKF-BqdChOOY2PJJG5R0QuyfblVM=.37a92dfd-5de4-4924-83c5-f9c2e5d7548c@github.com> References: <73AnlXOv0T8K25DgsNdH1PkBjcBXz0f3bBYZx44LpAw=.439f5383-ffd1-44e8-9e11-4b5af9b6a278@github.com> <3f1UnDuYp2iYVcciKF-BqdChOOY2PJJG5R0QuyfblVM=.37a92dfd-5de4-4924-83c5-f9c2e5d7548c@github.com> Message-ID: On Wed, 2 Jul 2025 20:43:35 GMT, Chad Rakoczy wrote: >> src/hotspot/share/code/nmethod.cpp line 1653: >> >>> 1651: } >>> 1652: } >>> 1653: } >> >> Do we need this code? Shouldn't missing trampolined be caught during fixing call sites? > > If fixing call sites fails (like in the event of a missing trampoline) an assert will fail and the JVM will crash. I suppose it could be updated to abandon the relocation if that happens but that would require `fix_relocation_after_move` to return if it succeeded and proper handling by the caller. This is only an issue because Hotspot reduces the branch range for debug builds on aarch64 and Graal doesn't. If we're going to handle this case I think we should fail fast but it does raise the question of what should actually be done in this situation ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2180940888 From bulasevich at openjdk.org Wed Jul 2 21:18:46 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 2 Jul 2025 21:18:46 GMT Subject: Integrated: 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 06:39:18 GMT, Boris Ulasevich wrote: > This change addresses an intermittent crash in CompileBroker::print_heapinfo() when accessing JVMCI metadata after a CodeBlob::purge(). > > The issue is a regression after: > - JDK-8343789: JVMCI metadata was moved from nmethod into a separate blob. > - JDK-8352112: CodeBlob::purge() was updated to set _mutable_data to blob_end(). > > The change zeroes out _mutable_data_size, _relocation_size, and _metadata_size in purge() so that after purge jvmci_data_size() returns 0 and CompileBroker::print_heapinfo() won?t touch an invalid _metadata. This pull request has now been integrated. Changeset: 74822ce1 Author: Boris Ulasevich URL: https://git.openjdk.org/jdk/commit/74822ce12acaf9816aa49b75ab5817ced3710776 Stats: 3 lines in 2 files changed: 3 ins; 0 del; 0 mod 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate Reviewed-by: eastigeevich, phh ------------- PR: https://git.openjdk.org/jdk/pull/25608 From mdoerr at openjdk.org Wed Jul 2 21:43:39 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 2 Jul 2025 21:43:39 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <9UlIKIQGn3vum3P71THXlyJwJ1efJmJNlImCpYErex8=.e794eff7-516d-4c5f-8e02-f15e5b34cba6@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> <9UlIKIQGn3vum3P71THXlyJwJ1efJmJNlImCpYErex8=.e794eff7-516d-4c5f-8e02-f15e5b34cba6@github.com> Message-ID: On Wed, 2 Jul 2025 20:11:16 GMT, Dean Long wrote: > Makes sense, but according to the Developers' Guide, we can't do that because "A Bug or Enhancement with resolution Fixed is required to have a corresponding changeset in one of the OpenJDK repositories." https://github.com/openjdk/jdk/commit/cf75f1f9c6d2bc70c7133cb81c73a0ce0946dff9 is a corresponding changset. We can link it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3029400997 From duke at openjdk.org Wed Jul 2 22:11:41 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 2 Jul 2025 22:11:41 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v33] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [ ] Linux x64 fastdebug all > - [ ] Linux aarch64 fastdebug all > - [ ] ... Chad Rakoczy has updated the pull request incrementally with two additional commits since the last revision: - Enclose ImmutableDataReferencesCounterSize in parentheses - Let trampolines fix their owners ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23573/files - new: https://git.openjdk.org/jdk/pull/23573/files/70e4164e..c3245fb7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=32 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=31-32 Stats: 62 lines in 13 files changed: 11 ins; 19 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From duke at openjdk.org Wed Jul 2 22:24:07 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 2 Jul 2025 22:24:07 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v34] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [ ] Linux x64 fastdebug all > - [ ] Linux aarch64 fastdebug all > - [ ] ... Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: Update justification for skipping CallRelocation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23573/files - new: https://git.openjdk.org/jdk/pull/23573/files/c3245fb7..0f4ff964 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=33 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=32-33 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From duke at openjdk.org Wed Jul 2 22:24:09 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 2 Jul 2025 22:24:09 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 08:37:32 GMT, Andrew Haley wrote: >> Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: >> >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Update how call sites are fixed >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Fix pointer printing >> - Use set_destination_mt_safe >> - Print address as pointer >> - Use new _metadata_size instead of _jvmci_data_size >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Only check branch distance for aarch64 and riscv >> - Move far branch fix to fix_relocation_after_move >> - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e > > src/hotspot/cpu/aarch64/relocInfo_aarch64.cpp line 84: > >> 82: if (NativeCall::is_call_at(addr())) { >> 83: NativeCall* call = nativeCall_at(addr()); >> 84: if (be_safe) { > > Why is this change necessary? The original motivation was to address far call sites. After relocation, some calls that previously didn't require a trampoline might now need one, hence the introduction of the `be_safe` parameter. However, upon further review, this change is unnecessary. The method `trampoline_stub_Relocation::fix_relocation_after_move` already updates the owner and contains the logic to determine whether a direct call can be performed. Therefore, we can skip invoking `CallRelocation::fix_relocation_after_move` for calls that use trampolines, as all required adjustments will be handled correctly by the trampoline relocations. ([Reference](https://github.com/chadrako/jdk/blob/0f4ff9646d1f7f43214c5ccd4bbe572fffd08d16/src/hotspot/share/code/nmethod.cpp#L1547-L1556)) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2181076900 From sviswanathan at openjdk.org Wed Jul 2 23:05:42 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 2 Jul 2025 23:05:42 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v3] In-Reply-To: <6cWhCvx8g-Gx4VoBHW1wA7atsa_Eq5wBhkDolUbP_X0=.31f8e688-7401-4f81-9b50-46b1997e96b5@github.com> References: <6cWhCvx8g-Gx4VoBHW1wA7atsa_Eq5wBhkDolUbP_X0=.31f8e688-7401-4f81-9b50-46b1997e96b5@github.com> Message-ID: On Tue, 1 Jul 2025 13:36:20 GMT, Jatin Bhateja wrote: >> Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. >> >> While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios >> >> This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Adding comments Looks good to me. It will be good to get second review. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26062#pullrequestreview-2980870863 From sparasa at openjdk.org Wed Jul 2 23:32:41 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 2 Jul 2025 23:32:41 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 17:44:34 GMT, Jatin Bhateja wrote: >>> For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker ... >> >> Using the suggested code as a base, Vamsi and I tinkered with the idea some more! Here is what we came up with. This also tracks the correct order of registers being pushed/poped.. (haven't compiled it, so might have some syntax bugs). >> >> @dholmes-ora would you mind sharing your opinion? We seem to be making things more complicated, but hopefully in a good way? >> >> Also included a sample usage in a stub. >> >> >> #define __ _masm-> >> >> class PushPopTracker { >> private: >> int _counter; >> MacroAssembler *_masm; >> const int REGS = 32; // Increase as needed >> int regs[REGS]; >> public: >> PushPopTracker(MacroAssembler *_masm) : _counter(0), _masm(_masm) {} >> ~PushPopTracker() { >> assert(_counter == 0, "Push/pop pair mismatch"); >> } >> >> void push(Register reg) { >> assert(_counter> regs[_counter++] = reg.encoding(); >> if (VM_Version::supports_apx_f()) { >> __ pushp(reg); >> } else { >> __ push(reg); >> } >> } >> void pop(Register reg) { >> assert(_counter>0, "Push/pop underflow"); >> assert(regs[_counter] == reg.encoding(), "Push/pop pair mismatch: %d != %d", regs[_counter], reg.encoding()); >> _counter--; >> if (VM_Version::supports_apx_f()) { >> __ popp(reg); >> } else { >> __ pop(reg); >> } >> } >> } >> >> address StubGenerator::generate_intpoly_montgomeryMult_P256() { >> __ align(CodeEntryAlignment); >> /*...*/ >> address start = __ pc(); >> __ enter(); >> PushPopTracker s(_masm); >> s.push(r12); //1 >> s.push(r13); //2 >> s.push(r14); //3 >> #ifdef _WIN64 >> s.push(rsi); //4 >> s.push(rdi); //5 >> #endif >> s.push(rbp); //6 >> __ movq(rbp, rsp); >> __ andq(rsp, -32); >> __ subptr(rsp, 32); >> // Register Map >> const Register aLimbs = c_rarg0; // c_rarg0: rdi | rcx >> const Register bLimbs = rsi; // c_rarg1: rsi | rdx >> const Register rLimbs = r8; // c_rarg2: rdx | r8 >> const Register tmp1 = r9; >> const Register tmp2 = r10; >> /*...*/ >> __ movq(rsp, rbp); >> s.pop(rbp); //5 >> #ifdef _WIN64 >> s.pop(rdi); //4 >> s.pop(rsi); //3 >> #endif >> s.pop(r14); //2 >> s.pop(r13); //1 >> s.pop(r12); //0 >> __ leave(); >> __ ret(0); >> return start; >> } > > @vamsi-parasa, It's better to make this as a subclass of MacroAssembler in src/hotspot/cpu/x86/macroAssembler_x86.hpp and pass Tracker as an argument to push / pop for a cleaner interface. Hi Jatin (@jatin-bhateja) and Vlad (@vpaprotsk), There's one more issue to be considered. The C++ PushPopTracker code will be run during the stub generation time. There are code bocks which do a single push onto the stack but due to multiple exit paths, there will be multiple pops as illustrated below. Will this reference counting approach not fail in such a scenario as the stub code is generated all at once during the stub generation phase? #begin stack frame push(r21) #exit condition 1 pop(r21) # exit condition 2 pop(r21) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2181146890 From dlong at openjdk.org Wed Jul 2 23:53:39 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 2 Jul 2025 23:53:39 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: <1LxESKwrZ2cxtTlNTIKruyyebF-hportTvFYoYc4htY=.207724e0-418f-4289-8190-2545c74fc191@github.com> On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. Testing results look good. There was one timeout in a jshell test, but it seems unrelated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3029712997 From duke at openjdk.org Thu Jul 3 01:52:52 2025 From: duke at openjdk.org (erifan) Date: Thu, 3 Jul 2025 01:52:52 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v2] In-Reply-To: References: Message-ID: > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Address some review comments Add support for the following patterns: toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) toLong(maskAll(false)) => 0 And add more test cases. - Merge branch 'master' into JDK-8356760 - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. Some JTReg test cases are added to ensure the optimization is effective. I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. [1] https://github.com/openjdk/jdk/pull/24674 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25793/files - new: https://git.openjdk.org/jdk/pull/25793/files/38664b06..791e0ab7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=00-01 Stats: 24487 lines in 940 files changed: 11237 ins; 8323 del; 4927 mod Patch: https://git.openjdk.org/jdk/pull/25793.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793 PR: https://git.openjdk.org/jdk/pull/25793 From duke at openjdk.org Thu Jul 3 02:00:49 2025 From: duke at openjdk.org (erifan) Date: Thu, 3 Jul 2025 02:00:49 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v2] In-Reply-To: References: Message-ID: <9NNhM-s8jWMJnb_DcTeEzeVBxpIYODi611mDQ-so7DQ=.a238b776-fb3b-43fe-b4ac-782d41c8d9aa@github.com> On Thu, 3 Jul 2025 01:52:52 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 Thanks for your review! Would you mind taking another look, thanks! ------------- PR Review: https://git.openjdk.org/jdk/pull/25793#pullrequestreview-2981231350 From duke at openjdk.org Thu Jul 3 02:00:50 2025 From: duke at openjdk.org (erifan) Date: Thu, 3 Jul 2025 02:00:50 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v2] In-Reply-To: <34p1DHverqucTroSmERaeSx94Knl2FMfVWxedlij0JA=.a4ab7090-8a1c-421a-bc4b-7e1c17f03246@github.com> References: <34p1DHverqucTroSmERaeSx94Knl2FMfVWxedlij0JA=.a4ab7090-8a1c-421a-bc4b-7e1c17f03246@github.com> Message-ID: On Fri, 27 Jun 2025 06:04:54 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 80: >> >>> 78: return false; >>> 79: } >>> 80: long mask = (0xFFFFFFFFFFFFFFFFULL >> (64 - vlen)); >> >> The higher bits of long input should be cleared. So we should generate an unsigned right shift instead of the signed one? > > I noticed that you used `ULL` suffix. So it should be fine. Please ignore above comment. Thanks! Yeah, thanks~ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2181359573 From duke at openjdk.org Thu Jul 3 02:00:51 2025 From: duke at openjdk.org (erifan) Date: Thu, 3 Jul 2025 02:00:51 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v2] In-Reply-To: <1mmwSiX2OCyFw8bKOj6U1yabINpsZiNblYbvAF8l6dM=.00a75235-c87a-4f04-b863-1f6dc046e4e4@github.com> References: <1mmwSiX2OCyFw8bKOj6U1yabINpsZiNblYbvAF8l6dM=.00a75235-c87a-4f04-b863-1f6dc046e4e4@github.com> Message-ID: On Thu, 26 Jun 2025 07:49:28 GMT, Xiaohong Gong wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Address some review comments >> >> Add support for the following patterns: >> toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) >> toLong(maskAll(false)) => 0 >> >> And add more test cases. >> - Merge branch 'master' into JDK-8356760 >> - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases >> >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would >> set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent >> to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is >> relative smaller than that of `fromLong`. This patch does the conversion >> for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize >> maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since >> the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific >> compile-time constant, the statement will be hoisted out of the loop. >> If we don't use a loop, the hotspot will become other instructions, and >> no obvious performance change was observed. However, combined with the >> optimization of [1], we can observe a performance improvement of about >> 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and >> tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > src/hotspot/share/opto/vectorIntrinsics.cpp line 706: > >> 704: opc = Op_Replicate; >> 705: elem_bt = converted_elem_bt; >> 706: bits = gvn().longcon(bits_type->get_con() == 0L ? 0L : -1L); > > Code style. Suggest: > > if (opc == Op_VectorLongToMask && > is_maskall_type(bits_type, num_elem) && > arch_supports_vector(Op_Replicate, num_elem, converted_elem_bt, checkFlags, true /*has_scalar_args*/)) { > opc = Op_Replicate; > elem_bt = converted_elem_bt; > bits = gvn().longcon(bits_type->get_con() == 0L ? 0L : -1L); > } else if ( Done > So if bits = 0xf0, and the vlen = 4, what is the expected mask? This is not possible because the input value has been processed in `VectorMask::fromLong`. See https://github.com/openjdk/jdk/blob/74822ce12acaf9816aa49b75ab5817ced3710776/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorMask.java#L242 But for safety, double checked the lowest bit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2181360080 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2181377150 From duke at openjdk.org Thu Jul 3 02:05:21 2025 From: duke at openjdk.org (hanguanqiang) Date: Thu, 3 Jul 2025 02:05:21 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode Message-ID: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode Problem? When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. Root Cause? Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. Fix Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. ------------- Commit messages: - 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode Changes: https://git.openjdk.org/jdk/pull/26108/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26108&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8358568 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26108.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26108/head:pull/26108 PR: https://git.openjdk.org/jdk/pull/26108 From dlong at openjdk.org Thu Jul 3 02:17:38 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 3 Jul 2025 02:17:38 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: On Thu, 3 Jul 2025 01:59:55 GMT, hanguanqiang wrote: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. I don't see the point of trying to support this flag. Can we just get rid of it? I don't think it is ever tested, because testing would surely crash unless the JVM ran as single-threaded somehow, which it doesn't. Maybe at some point this flag was useful for getting a new port limping along, but I think stubbing sync code would work just as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3030292214 From haosun at openjdk.org Thu Jul 3 02:19:45 2025 From: haosun at openjdk.org (Hao Sun) Date: Thu, 3 Jul 2025 02:19:45 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v10] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 08:26:00 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments Overall, looks good to me except several nits. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 5159: > 5157: // consecutive. The match rules for SelectFromTwoVector reserve two consecutive vector registers > 5158: // for src1 and src2. > 5159: // Four combinations of vector registers each for vselect_from_two_vectors_HS_Neon and I suppose the function names are changed now. Should use `select_from_two_vectors_Neon` and `select_from_two_vectors_SVE` instead. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 5199: > 5197: __ select_from_two_vectors_SVE($dst$$FloatRegister, $src1$$FloatRegister, > 5198: $src2$$FloatRegister, $index$$FloatRegister, > 5199: $tmp$$FloatRegister, bt, length_in_bytes); nit: Inside `select_from_two_vectors_SVE()`, `bt` is only used to compute `elemType_to_regVariant(bt)`. I suggest using `get_reg_variant(this)` here directly. src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2886: > 2884: bool is_byte = (bt == T_BYTE); > 2885: > 2886: if (is_byte) { Suggestion: if (bt == T_BYTE) { src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2901: > 2899: } > 2900: } else { > 2901: int elemSize = (bt == T_SHORT) ? 2 : 4; nit: use `elem_size` src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2902: > 2900: } else { > 2901: int elemSize = (bt == T_SHORT) ? 2 : 4; > 2902: uint64_t tblOffset = (bt == T_SHORT) ? 0x0100u : 0x03020100u; nit: use `tbl_offset` src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.hpp line 197: > 195: > 196: // Select from a table of two vectors > 197: void select_from_two_vectors_Neon(FloatRegister dst, FloatRegister src1, FloatRegister src2, As for the function name, I suggest using `select_from_two_vectors_(neon|sve)`. E.g., `vector_signum_(neon|sve)` or `vector_round_(neon|sve)` as defined in this file. ------------- Marked as reviewed by haosun (Committer). PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-2978225584 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2179445324 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2181370525 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2181383791 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2181384078 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2181384185 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2179465592 From xgong at openjdk.org Thu Jul 3 02:24:47 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 3 Jul 2025 02:24:47 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v2] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 01:52:52 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 Looks much better to me. Thanks for your updating! ------------- Marked as reviewed by xgong (Committer). PR Review: https://git.openjdk.org/jdk/pull/25793#pullrequestreview-2981322138 From dlong at openjdk.org Thu Jul 3 02:35:45 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 3 Jul 2025 02:35:45 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. > > Makes sense, but according to the Developers' Guide, we can't do that because "A Bug or Enhancement with resolution Fixed is required to have a corresponding changeset in one of the OpenJDK repositories." > > [cf75f1f](https://github.com/openjdk/jdk/commit/cf75f1f9c6d2bc70c7133cb81c73a0ce0946dff9) is a corresponding changset. We can link it. So two bugs would reference the same changeset, but the changeset only names 8358821? It might be better to close 8357017 as a duplicate instead of as Fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3030333773 From duke at openjdk.org Thu Jul 3 03:16:43 2025 From: duke at openjdk.org (hanguanqiang) Date: Thu, 3 Jul 2025 03:16:43 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: On Thu, 3 Jul 2025 01:59:55 GMT, hanguanqiang wrote: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. I?ve investigated some of the earliest versions of the source code, including JDK 6, but was unable to identify the original author of this flag or its intended purpose. In any case, if someone with the authority agrees that this flag is no longer relevant, I?d be glad to take on the task of removing it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3030446171 From jkarthikeyan at openjdk.org Thu Jul 3 03:27:32 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 3 Jul 2025 03:27:32 GMT Subject: RFR: 8349563: Improve AbsNode::Value() for integer types [v4] In-Reply-To: References: Message-ID: <_sSUlLFhpG8Ton-bIB3u6Nf7YSxb8LQNzngDDLqrwcA=.5c456420-a5bd-406b-8cea-e6d2ac8d74c9@github.com> > Hi all, > This is a small patch that improves the implementation of Value() for `AbsINode` and `AbsLNode` by returning the absolute value of the input range. Most of the logic is trivial except for the special case where `_lo == jint_min/jlong_min` which must return the entire type range when encountered, for which I've added a small proof in the comments. I've also added some unit tests and updated the file to limit IR check platforms with more granularity. > > Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits: - Code review and constant folding test - Merge - Replace uabs usage with ABS - Merge branch 'master' into abs-value - Merge - Improve AbsNode::Value ------------- Changes: https://git.openjdk.org/jdk/pull/23685/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23685&range=03 Stats: 299 lines in 2 files changed: 284 ins; 4 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/23685.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23685/head:pull/23685 PR: https://git.openjdk.org/jdk/pull/23685 From jkarthikeyan at openjdk.org Thu Jul 3 03:34:44 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 3 Jul 2025 03:34:44 GMT Subject: RFR: 8349563: Improve AbsNode::Value() for integer types [v3] In-Reply-To: References: Message-ID: On Wed, 28 May 2025 11:56:02 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Replace uabs usage with ABS >> - Merge branch 'master' into abs-value >> - Merge >> - Improve AbsNode::Value > > test/hotspot/jtreg/compiler/c2/irTests/TestIRAbs.java line 333: > >> 331: // [-9, -2] => [2, 9] >> 332: return Math.abs(-((in & 7) + 2)) > 9; >> 333: } > > Could we have some randomized cases here too? Or do we already have them somewhere? I've added support for randomized ranges and if statement folding as suggested in the review comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23685#discussion_r2181563680 From jkarthikeyan at openjdk.org Thu Jul 3 03:40:43 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 3 Jul 2025 03:40:43 GMT Subject: RFR: 8349563: Improve AbsNode::Value() for integer types [v3] In-Reply-To: References: Message-ID: On Wed, 28 May 2025 11:57:51 GMT, Emanuel Peter wrote: >> Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - Replace uabs usage with ABS >> - Merge branch 'master' into abs-value >> - Merge >> - Improve AbsNode::Value > > @jaskarth Nice work! I have a few comments below. > > One is about more randomized tests. I'm thinking about something like this: > > - compute `res = Math.abs(x)` > - Truncate `x` with randomly produced bounds from Generators, like this: `x = Math.max(lo, Math.min(hi, x))`. > - Below, add all sorts of comparisons with random constants, like this: `if (res < CON) { sum += 1; }`. If the output range is wrong, this could wrongly constant fold, and allow us to catch that. > - Then fuzz the generated method a few times with random inputs for `x`, and check that the sum and res value are the same for compiled and interpreted code. > > I hope that makes sense :) > This is currently my best method to check if ranges are correct, and I think it is quite important because often tests are only written with constants in mind, but less so with ranges, and then we mess up the ranges because it is just too tricky. > > This is an example, where I asked someone to try this out as well: > https://github.com/openjdk/jdk/pull/23089/files#diff-12bebea175a260a6ab62c22a3681ccae0c3d9027900d2fdbd8c5e856ae7d1123R404-R422 @eme64 Thanks for the review and comments! The method of checking for constant folding with if statements and range filtering you mentioned is pretty clever. I've adapted it to the test and added it to the PR. Let me know what you think! > src/hotspot/share/opto/subnode.cpp line 1947: > >> 1945: >> 1946: return IntegerType::make(ABS(t->get_con())); >> 1947: } > > We used `uabs` before, what prevents you from doing that now? I guess you would need a templated version, hmm. Could be worth looking into creating one. There was an earlier discussion in the review: https://github.com/openjdk/jdk/pull/23685#discussion_r1972735806 Essentially, the implementation of `uabs` relies on converting ints/longs from signed to unsigned which is implementation defined until C++20. I believe the implementation works as expected on most platforms, but to be cautious I thought it would be better to just handle it manually to avoid any potential problems. We should revisit when we're at C++20 ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23685#issuecomment-3030523657 PR Review Comment: https://git.openjdk.org/jdk/pull/23685#discussion_r2181570566 From dholmes at openjdk.org Thu Jul 3 04:43:39 2025 From: dholmes at openjdk.org (David Holmes) Date: Thu, 3 Jul 2025 04:43:39 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: On Thu, 3 Jul 2025 01:59:55 GMT, hanguanqiang wrote: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. The patch seems reasonable from a backporting perspective. Though it does beg the question as to why `do_monitor_enter` does not need the same fix. I suspect this is a very old flag and the code has bit-rotted somewhat. A question for the compiler folk: does `GenerateSynchronizationCode` still have any use or should it be scrapped? Thanks ------------- PR Review: https://git.openjdk.org/jdk/pull/26108#pullrequestreview-2981633438 From haosun at openjdk.org Thu Jul 3 04:47:41 2025 From: haosun at openjdk.org (Hao Sun) Date: Thu, 3 Jul 2025 04:47:41 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v6] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 08:48:59 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: > > cleanup: update a copyright notice > > Co-authored-by: Hao Sun Hi. This PR involves the change to {Int Mul Reduction, FP Mul Reduction} X { auto-vectorization, VectorAPI}. After the offiline discussion with @XiaohongGong , we have one question about the impact of this PR on **FP Mul Reduction + auto-vectorization**. Here lists the change before and after this PR in whether **FP Mul Reduction + auto-vectorization** is on or off. | | Check | before | after| | :-------- | :-------: | --------: | --------: | | case-1 | UseSVE=0 | off | off | | case-2 | UseSVE>0 and length_in_bytes=8or16 | on | off | | case-3 | UseSVE>0 and length_in_bytes>16 | off | off | ## case-1 and case-2 Background: case-1 was set off after @fg1417 's patch [8275275: AArch64: Fix performance regression after auto-vectorization on NEON](https://github.com/openjdk/jdk/pull/10175). But case-2 was not touched. We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on **reduction**? **Limitation** of SLP as mentioned in @fg1417 's patch > Because superword doesn't vectorize reductions unconnected with other vector packs, Performance data in this PR on case-2: From your provided [test data](https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067) on `Neoverse V2 (SVE 128-bit). Auto-vectorization section`, there is no obvious performance change on FP Mul Reduction benchmarks `(float|double)Mul(Big|Simple)`. As we checked the generated code of `floatMul(Big|Simple)` on Nvidia Grace machine(128b SVE2), we found that before this PR: - `floatMulBig` is vectorized. - `floatMulSimple` is not vectorized because SLP determines that there is no profit. Discussion: should we enable case-1 and case-2? - if the SLP limitation on reductions is fixed? - If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop. It would be appreciated if @eme64 or @fg1417 could provide more inputs. ## case-3 Status: this PR adds rules `reduce_mulF_gt128b` and `reduce_mulD_gt128b` but these two rules are **not** selected. See the [comment from Xiaohong](https://github.com/openjdk/jdk/pull/23181/files#r2176590314). Our suggestion: we're not sure if it's profitable to enable case-3. Could you help do more test on `Neoverse V1 (SVE 256-bit)`? Note that local change should be made to enable case-3, e.g. removing [these lines](https://github.com/openjdk/jdk/pull/23181/files#diff-edf6d70f65d81dc12a483088e0610f4e059bd40697f242aedfed5c2da7475f1aR130-R136). Expected result: - If there is performance gain, we may consider enabling case-3 for auto-vectorization. - If there is no performance gain, we suggest removing these two match rules because they are dead code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3030705608 From dholmes at openjdk.org Thu Jul 3 04:55:39 2025 From: dholmes at openjdk.org (David Holmes) Date: Thu, 3 Jul 2025 04:55:39 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: On Thu, 19 Jun 2025 06:39:52 GMT, David Holmes wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > Just a drive-by comment as this isn't code I normally have much to do with but to me it would look a lot cleaner to define `push_paired`/`pop_paired` (maybe abbreviating directly to `pushp`/`popp`?) rather than passing the boolean. > @dholmes-ora would you mind sharing your opinion? We seem to be making things more complicated, but hopefully in a good way? Seems very complicated to me. Really this is for compiler folk to discuss. And as noted above this "tracker" class only helps where the push/pop are paired in the same scope. Personally I think a "pushp" that is defined to be a "push-paired" when available, else a regular "push", would suffice in terms of API design. But again this is for compiler folk to determine. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25889#issuecomment-3030744652 From epeter at openjdk.org Thu Jul 3 05:02:40 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 3 Jul 2025 05:02:40 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v3] In-Reply-To: References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Wed, 2 Jul 2025 10:19:59 GMT, Beno?t Maillard wrote: >> This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. >> >> By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) >> - [x] tier1-3, plus some internal testing >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > 8361144: add comment for consistency with node count Thanks for adding this @benoitmaillard ! I don't think you need to add a regression test here. What you should do though: run tier1-3 + additional testing, one with the verification enabled and once without. Just to see if there are any cases that currently fail with this verification. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26064#pullrequestreview-2981695058 From epeter at openjdk.org Thu Jul 3 05:26:46 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 3 Jul 2025 05:26:46 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 12:35:47 GMT, Mikhail Ablakatov wrote: >> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision: >> >> - fixup: don't modify the value in vsrc >> >> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this >> change, the result of recursive folding is held in vtmp1. To be able to >> pass this intermediate result to reduce_mul_integral_le128b(), we would >> have to use another temporary FloatRegister, as vtmp1 would essentially >> act as vsrc. It's possible to get around this however: >> reduce_mul_integral_le128b() is modified so it's possible to pass >> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a >> temporary register in rules that match to reduce_mul_integral_gt128b(). >> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating > > This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks. > > ### Neoverse N1 (NEON) > >
> > Auto-vectorization > > | Benchmark | Before | After | Units | Diff | > |---------------------------|----------|----------|-------|------| > | mulRedD | 739.699 | 740.884 | ns/op | ~ | > | byteAddBig | 2670.248 | 2670.562 | ns/op | ~ | > | byteAddSimple | 1639.796 | 1639.940 | ns/op | ~ | > | byteMulBig | 2707.900 | 2708.063 | ns/op | ~ | > | byteMulSimple | 2452.939 | 2452.906 | ns/op | ~ | > | charAddBig | 2772.363 | 2772.269 | ns/op | ~ | > | charAddSimple | 1639.867 | 1639.751 | ns/op | ~ | > | charMulBig | 2796.533 | 2796.375 | ns/op | ~ | > | charMulSimple | 2453.034 | 2453.004 | ns/op | ~ | > | doubleAddBig | 2943.613 | 2936.897 | ns/op | ~ | > | doubleAddSimple | 1635.031 | 1634.797 | ns/op | ~ | > | doubleMulBig | 3001.937 | 3003.240 | ns/op | ~ | > | doubleMulSimple | 2448.154 | 2448.117 | ns/op | ~ | > | floatAddBig | 2963.086 | 2962.215 | ns/op | ~ | > | floatAddSimple | 1634.987 | 1634.798 | ns/op | ~ | > | floatMulBig | 3022.442 | 3021.356 | ns/op | ~ | > | floatMulSimple | 2447.976 | 2448.091 | ns/op | ~ | > | intAddBig | 832.346 | 832.382 | ns/op | ~ | > | intAddSimple | 841.276 | 841.287 | ns/op | ~ | > | intMulBig | 1245.155 | 1245.095 | ns/op | ~ | > | intMulSimple | 1638.762 | 1638.826 | ns/op | ~ | > | longAddBig | 4924.541 | 4924.328 | ns/op | ~ | > | longAddSimple | 841.623 | 841.625 | ns/op | ~ | > | longMulBig | 9848.954 | 9848.807 | ns/op | ~ | > | longMulSimple | 3427.169 | 3427.279 | ns/op | ~ | > | shortAddBig | 2670.027 | 2670.345 | ns/op | ~ | > | shortAddSimple | 1639.869 | 1639.876 | ns/op | ~ | > | shortMulBig | 2750.812 | 2750.562 | ns/op | ~ | > | shortMulSimple | 2453.030 | 2452.937 | ns/op | ~ | > >
> >
> > VectorAPI > > | Benchmark ... @mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply. However, I do plan to remove the auto-vectorization restrictions for simple reductions. https://bugs.openjdk.org/browse/JDK-8307516 You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`. https://bugs.openjdk.org/browse/JDK-8357530 I published benchmark results there: https://github.com/openjdk/jdk/pull/25387 You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable. It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;) I don't have access to any SVE machines, so I cannot help you there, unfortunately. Is this helpful to you? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3030798159 From epeter at openjdk.org Thu Jul 3 05:30:40 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 3 Jul 2025 05:30:40 GMT Subject: RFR: 8349563: Improve AbsNode::Value() for integer types [v3] In-Reply-To: References: Message-ID: <-iWCtGzKfoilC1bFXj726ZTS8glyDlqRdY76ddUdgb0=.2b303302-ee5b-4982-a72d-a56be53a5101@github.com> On Thu, 3 Jul 2025 03:38:13 GMT, Jasmine Karthikeyan wrote: >> @jaskarth Nice work! I have a few comments below. >> >> One is about more randomized tests. I'm thinking about something like this: >> >> - compute `res = Math.abs(x)` >> - Truncate `x` with randomly produced bounds from Generators, like this: `x = Math.max(lo, Math.min(hi, x))`. >> - Below, add all sorts of comparisons with random constants, like this: `if (res < CON) { sum += 1; }`. If the output range is wrong, this could wrongly constant fold, and allow us to catch that. >> - Then fuzz the generated method a few times with random inputs for `x`, and check that the sum and res value are the same for compiled and interpreted code. >> >> I hope that makes sense :) >> This is currently my best method to check if ranges are correct, and I think it is quite important because often tests are only written with constants in mind, but less so with ranges, and then we mess up the ranges because it is just too tricky. >> >> This is an example, where I asked someone to try this out as well: >> https://github.com/openjdk/jdk/pull/23089/files#diff-12bebea175a260a6ab62c22a3681ccae0c3d9027900d2fdbd8c5e856ae7d1123R404-R422 > > @eme64 Thanks for the review and comments! The method of checking for constant folding with if statements and range filtering you mentioned is pretty clever. I've adapted it to the test and added it to the PR. Let me know what you think! @jaskarth Nice, thanks for adding the range tests! Unfortunately, I'm quite busy before going on vacation. I hope someone else can review this. Otherwise I can come back to it in August. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23685#issuecomment-3030813369 From xgong at openjdk.org Thu Jul 3 05:56:41 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 3 Jul 2025 05:56:41 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 05:23:38 GMT, Emanuel Peter wrote: > You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable. > > It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;) > > I don't have access to any SVE machines, so I cannot help you there, unfortunately. > >Is this helpful to you? Thanks for your input @eme64 ! It's really helpful to me. And it would be the right direction that using the cost model to guide whether vectorizing FP mul reduction is profitable or not. With this, I think the backend check of auto-vectorization for such operations can be removed safely. We can relay on the SLP's analysis. BTW, the current profitability heuristics can provide help on disabling auto-vectorization for the simple cases while enabling the complex ones. This is also helpful to us. I tested the performance of `VectorReduction2` with/without auto-vectorization for FP mul reductions on my SVE 128-bit machine. The performance difference is not very significant for both `floatMulSimple` and `floatMulBig`. But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! If the performance of `floatMulBig` is improved with auto-vectorization, I think we can remove the limitation of such reductions for auto-vectorization on AArch64. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3030931690 From xgong at openjdk.org Thu Jul 3 06:10:28 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 3 Jul 2025 06:10:28 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: Message-ID: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> > ### Background > On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. > > For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. > > To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. > > ### Impact Analysis > #### 1. Vector types > Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. > > #### 2. Vector API > No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. > > #### 3. Auto-vectorization > Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. > > #### 4. Codegen of vector nodes > NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. > > Details: > - Lanewise vector operations are unaffected as explained above. > - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). > - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_s... Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Refine the comment in ad file ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26057/files - new: https://git.openjdk.org/jdk/pull/26057/files/4e15e588..dfda42a3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/26057.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26057/head:pull/26057 PR: https://git.openjdk.org/jdk/pull/26057 From mhaessig at openjdk.org Thu Jul 3 06:13:40 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 3 Jul 2025 06:13:40 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v3] In-Reply-To: <6cWhCvx8g-Gx4VoBHW1wA7atsa_Eq5wBhkDolUbP_X0=.31f8e688-7401-4f81-9b50-46b1997e96b5@github.com> References: <6cWhCvx8g-Gx4VoBHW1wA7atsa_Eq5wBhkDolUbP_X0=.31f8e688-7401-4f81-9b50-46b1997e96b5@github.com> Message-ID: On Tue, 1 Jul 2025 13:36:20 GMT, Jatin Bhateja wrote: >> Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. >> >> While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios >> >> This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Adding comments Thank you for addressing my comments @jatin-bhateja. Looks good to me. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26062#pullrequestreview-2981858755 From dfenacci at openjdk.org Thu Jul 3 06:19:38 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 3 Jul 2025 06:19:38 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v3] In-Reply-To: References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Wed, 2 Jul 2025 10:19:59 GMT, Beno?t Maillard wrote: >> This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. >> >> By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) >> - [x] tier1-3, plus some internal testing >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > 8361144: add comment for consistency with node count Looks good to me. Thanks @benoitmaillard! ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/26064#pullrequestreview-2981870146 From duke at openjdk.org Thu Jul 3 07:03:49 2025 From: duke at openjdk.org (erifan) Date: Thu, 3 Jul 2025 07:03:49 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v7] In-Reply-To: <15TW6hiffz65NhHevPefL_6swSC07UD-GwiJ4tPDtFs=.b83081df-8abd-4756-b4e0-1d969678a0d2@github.com> References: <15TW6hiffz65NhHevPefL_6swSC07UD-GwiJ4tPDtFs=.b83081df-8abd-4756-b4e0-1d969678a0d2@github.com> Message-ID: On Thu, 5 Jun 2025 11:05:48 GMT, Emanuel Peter wrote: >>> > FYI: `BoolTest::negate` already does what you want: `mask negate( ) const { return mask(_test^4); }` I think you should use that instead :) >>> >>> Indeed, I hadn't noticed that, thank you. >> >> Oh I think we still cannot use `BoolTest::negate`, because we cannot instantiate a `BoolTest` object with **unsigned** comparison. `BoolTest::negate` is a non-static function. > >> Oh I think we still cannot use `BoolTest::negate`, because we cannot instantiate a `BoolTest` object with **unsigned** comparison. `BoolTest::negate` is a non-static function. > > I see. Ok. Hmm. I still think that the logic should be in `BoolTest`, because that is where the exact implementation of the enum values is. In that context it is easier to see why `^4` does the negation. And imagine we were ever to change the enum values, then it would be harder to find your code and fix it. > > Maybe it could be called `BoolTest::negate_mask(mast btm)` and explain in a comment that both signed and unsigned is supported. Hi @eme64 @jatin-bhateja , would you mind taking another look of this PR, thanks~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/24674#issuecomment-3031109432 From duke at openjdk.org Thu Jul 3 07:10:22 2025 From: duke at openjdk.org (erifan) Date: Thu, 3 Jul 2025 07:10:22 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: Message-ID: > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 erifan has updated the pull request incrementally with one additional commit since the last revision: Simplify the test code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25793/files - new: https://git.openjdk.org/jdk/pull/25793/files/791e0ab7..9f07d5c7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=01-02 Stats: 233 lines in 3 files changed: 40 ins; 180 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/25793.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793 PR: https://git.openjdk.org/jdk/pull/25793 From duke at openjdk.org Thu Jul 3 07:10:23 2025 From: duke at openjdk.org (erifan) Date: Thu, 3 Jul 2025 07:10:23 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v2] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 01:52:52 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 Hi @eme64 @jatin-bhateja , could you help review this PR? Thanks~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3031127486 From duke at openjdk.org Thu Jul 3 07:10:42 2025 From: duke at openjdk.org (duke) Date: Thu, 3 Jul 2025 07:10:42 GMT Subject: RFR: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP [v5] In-Reply-To: References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: On Wed, 2 Jul 2025 07:19:30 GMT, Beno?t Maillard wrote: >> This PR prevents some missed ideal optimizations in IGVN by notifying users of type refinements made during CCP, addressing a missed optimization that caused a verification failure with `-XX:VerifyIterativeGVN=1110`. >> >> ### Context >> During the compilation of the input program (obtained from the fuzzer, then simplified and added as a test) by C2, we end up with node `591 ModI` that takes `138 Phi` as its divisor input. An existing `Ideal` optimization is to get rid of the control input of a `ModINode` when we can prove that the divisor is never `0`. >> >> In this specific case, the type of the `PhiNode` gets refined during CCP, but the refinement fails to propagate to its users for the IGVN phase and the ideal optimization for the `ModINode` never happens. This results in a missed optimization and hits an assert in the verification phase of IGVN (when using `-XX:VerifyIterativeGVN=1110`). >> >> ![IGV screenshot](https://github.com/user-attachments/assets/5dee1ae6-9146-4115-922d-df33b7ccbd37) >> >> ### Detailed Analysis >> >> In `PhaseCCP::analyze`, we call `Value` for the `PhiNode`, which >> results in a type refinement: the range gets restricted to `int:-13957..-1191`. >> >> ```c++ >> // Pull from worklist; compute new value; push changes out. >> // This loop is the meat of CCP. >> while (worklist.size() != 0) { >> Node* n = fetch_next_node(worklist); >> DEBUG_ONLY(worklist_verify.push(n);) >> if (n->is_SafePoint()) { >> // Make sure safepoints are processed by PhaseCCP::transform even if they are >> // not reachable from the bottom. Otherwise, infinite loops would be removed. >> _root_and_safepoints.push(n); >> } >> const Type* new_type = n->Value(this); >> if (new_type != type(n)) { >> DEBUG_ONLY(verify_type(n, new_type, type(n));) >> dump_type_and_node(n, new_type); >> set_type(n, new_type); >> push_child_nodes_to_worklist(worklist, n); >> } >> if (KillPathsReachableByDeadTypeNode && n->is_Type() && new_type == Type::TOP) { >> // Keep track of Type nodes to kill CFG paths that use Type >> // nodes that become dead. >> _maybe_top_type_nodes.push(n); >> } >> } >> DEBUG_ONLY(verify_analyze(worklist_verify);) >> >> >> At the end of `PhaseCCP::analyze`, we obtain the following types in the side table: >> - `int` for node `591` (`ModINode`) >> - `int:-13957..-1191` for node `138` (`PhiNode`) >> >> If we call `find_node(138)->bottom_type()`, we get: >> - `int` for both nodes >> >> The... > > Beno?t Maillard has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Fix bad test class name > - 8359602: rename test > - 8359602: remove requires.debug=true and add -XX:+IgnoreUnrecognizedVMOptions flag > - 8359602: add comment > - 8359602: add test summary and comments > - 8359602: tag requires vm.debug == true > - 8359602: Add test from fuzzer > - 8359602: Add users to IGVN worklist when type is refined in CCP @benoitmaillard Your change (at version a66d3fb492541a17e28b3e0fe0f60080c14bdc2c) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26017#issuecomment-3031130840 From duke at openjdk.org Thu Jul 3 07:17:43 2025 From: duke at openjdk.org (duke) Date: Thu, 3 Jul 2025 07:17:43 GMT Subject: RFR: 8357739: [jittester] disable the hashCode method In-Reply-To: References: Message-ID: On Tue, 17 Jun 2025 19:49:34 GMT, Evgeny Nikitin wrote: > JITTester often uses the `hasCode` method (in fact, in almost every generated test). Given that the method can be unstable between runs or in interpreted vs compiled runs, it can create false-positives. > > This PR fixes the issue by adding support for method templates similar to the ones used in CompilerCommands). All of those exclude templates match (and exclude) `String.indexOf(String)`, for example: > > java/lang/::*(Ljava/lang/String;I) > *String::indexOf(*) > java/lang/*::indexOf > > > Additionally, the PR adds support for comments (starting from '#') and empty lines in the excludes file. @lepestock Your change (at version 5c9a71b9c5b6f418a97e6b0557431aafc73addc6) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25859#issuecomment-3031148211 From thartmann at openjdk.org Thu Jul 3 07:22:39 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 3 Jul 2025 07:22:39 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code In-Reply-To: <-7cfzVghCWnUCfB1F3dcyG2fvJGnqREUW98qiVJEvQQ=.db06fb1e-e96e-4e00-bac0-098b4e1de54c@github.com> References: <-7cfzVghCWnUCfB1F3dcyG2fvJGnqREUW98qiVJEvQQ=.db06fb1e-e96e-4e00-bac0-098b4e1de54c@github.com> Message-ID: On Wed, 2 Jul 2025 07:16:44 GMT, Tobias Hartmann wrote: > I submitted some testing to make sure that CTW is clean in our CI. I see the following crashes that would need to be fixed before this is integrated: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/open/src/hotspot/share/opto/phaseX.cpp:2790), pid=3196445, tid=3196462 # assert(!failure) failed: PhaseCCP not at fixpoint: analysis result may be unsound. # # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-02-0711056.tobias.hartmann.jdk4) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-02-0711056.tobias.hartmann.jdk4, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x180cff8] PhaseCCP::verify_analyze(Unique_Node_List&) [clone .part.0]+0x28 Current CompileTask: C2:13166 2238 b com.ibm.icu.impl.LocaleUtility::fallback (78 bytes) Stack: [0x00007f20eca0c000,0x00007f20ecb0c000], sp=0x00007f20ecb07050, free space=1004k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x180cff8] PhaseCCP::verify_analyze(Unique_Node_List&) [clone .part.0]+0x28 (phaseX.cpp:2790) V [libjvm.so+0x181e8aa] PhaseCCP::analyze()+0x7ca (phaseX.cpp:2790) V [libjvm.so+0xb44c94] Compile::Optimize()+0x964 (compile.cpp:2479) V [libjvm.so+0xb480d3] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1ec3 (compile.cpp:858) V [libjvm.so+0x96d157] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x467 (c2compiler.cpp:141) V [libjvm.so+0xb574f8] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xb58 (compileBroker.cpp:2323) V [libjvm.so+0xb586c8] CompileBroker::compiler_thread_loop()+0x578 (compileBroker.cpp:1967) V [libjvm.so+0x10abd0b] JavaThread::thread_main_inner()+0x13b (javaThread.cpp:773) V [libjvm.so+0x1b11f26] Thread::call_run()+0xb6 (thread.cpp:243) V [libjvm.so+0x178c718] thread_native_entry(Thread*)+0x128 (os_linux.cpp:868) # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/open/src/hotspot/share/opto/phaseX.cpp:784), pid=2175071, tid=2175089 # assert(no_dead_loop) failed: dead loop detected # # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-02-0711056.tobias.hartmann.jdk4) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-02-0711056.tobias.hartmann.jdk4, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x180d285] PhaseGVN::dead_loop_check(Node*) [clone .part.0]+0x1d5 Current CompileTask: C2:4914 2051 !b 4 com.sun.beans.introspect.MethodInfo::get (273 bytes) Stack: [0x00007fe603f00000,0x00007fe604000000], sp=0x00007fe603ffaef0, free space=1003k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x180d285] PhaseGVN::dead_loop_check(Node*) [clone .part.0]+0x1d5 (phaseX.cpp:784) V [libjvm.so+0x181c309] PhaseIterGVN::transform_old(Node*)+0x529 (phaseX.cpp:767) V [libjvm.so+0x1820505] PhaseIterGVN::optimize()+0xc5 (phaseX.cpp:1054) V [libjvm.so+0xb414ba] Compile::inline_incrementally_cleanup(PhaseIterGVN&)+0x2ca (compile.cpp:2151) V [libjvm.so+0xb41ed6] Compile::inline_incrementally(PhaseIterGVN&)+0x416 (compile.cpp:2201) V [libjvm.so+0xb447ae] Compile::Optimize()+0x47e (compile.cpp:2329) V [libjvm.so+0xb480d3] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1ec3 (compile.cpp:858) V [libjvm.so+0x96d157] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x467 (c2compiler.cpp:141) V [libjvm.so+0xb574f8] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xb58 (compileBroker.cpp:2323) V [libjvm.so+0xb586c8] CompileBroker::compiler_thread_loop()+0x578 (compileBroker.cpp:1967) V [libjvm.so+0x10abd0b] JavaThread::thread_main_inner()+0x13b (javaThread.cpp:773) V [libjvm.so+0x1b11f26] Thread::call_run()+0xb6 (thread.cpp:243) V [libjvm.so+0x178c718] thread_native_entry(Thread*)+0x128 (os_linux.cpp:868) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26068#issuecomment-3031160931 From bmaillard at openjdk.org Thu Jul 3 07:30:48 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Thu, 3 Jul 2025 07:30:48 GMT Subject: Integrated: 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP In-Reply-To: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> References: <-_MqCH6QmE-o_d7c9-aet-Cq-ptZJ6CZV6rodpDNWq0=.173e6f7a-3cfe-4791-8253-36e06d892069@github.com> Message-ID: On Fri, 27 Jun 2025 10:59:57 GMT, Beno?t Maillard wrote: > This PR prevents some missed ideal optimizations in IGVN by notifying users of type refinements made during CCP, addressing a missed optimization that caused a verification failure with `-XX:VerifyIterativeGVN=1110`. > > ### Context > During the compilation of the input program (obtained from the fuzzer, then simplified and added as a test) by C2, we end up with node `591 ModI` that takes `138 Phi` as its divisor input. An existing `Ideal` optimization is to get rid of the control input of a `ModINode` when we can prove that the divisor is never `0`. > > In this specific case, the type of the `PhiNode` gets refined during CCP, but the refinement fails to propagate to its users for the IGVN phase and the ideal optimization for the `ModINode` never happens. This results in a missed optimization and hits an assert in the verification phase of IGVN (when using `-XX:VerifyIterativeGVN=1110`). > > ![IGV screenshot](https://github.com/user-attachments/assets/5dee1ae6-9146-4115-922d-df33b7ccbd37) > > ### Detailed Analysis > > In `PhaseCCP::analyze`, we call `Value` for the `PhiNode`, which > results in a type refinement: the range gets restricted to `int:-13957..-1191`. > > ```c++ > // Pull from worklist; compute new value; push changes out. > // This loop is the meat of CCP. > while (worklist.size() != 0) { > Node* n = fetch_next_node(worklist); > DEBUG_ONLY(worklist_verify.push(n);) > if (n->is_SafePoint()) { > // Make sure safepoints are processed by PhaseCCP::transform even if they are > // not reachable from the bottom. Otherwise, infinite loops would be removed. > _root_and_safepoints.push(n); > } > const Type* new_type = n->Value(this); > if (new_type != type(n)) { > DEBUG_ONLY(verify_type(n, new_type, type(n));) > dump_type_and_node(n, new_type); > set_type(n, new_type); > push_child_nodes_to_worklist(worklist, n); > } > if (KillPathsReachableByDeadTypeNode && n->is_Type() && new_type == Type::TOP) { > // Keep track of Type nodes to kill CFG paths that use Type > // nodes that become dead. > _maybe_top_type_nodes.push(n); > } > } > DEBUG_ONLY(verify_analyze(worklist_verify);) > > > At the end of `PhaseCCP::analyze`, we obtain the following types in the side table: > - `int` for node `591` (`ModINode`) > - `int:-13957..-1191` for node `138` (`PhiNode`) > > If we call `find_node(138)->bottom_type()`, we get: > - `int` for both nodes > > There is no progress on the type of `ModINode` during CCP, because `ModINode::Value` > is not able to... This pull request has now been integrated. Changeset: c75df634 Author: Beno?t Maillard Committer: Emanuel Peter URL: https://git.openjdk.org/jdk/commit/c75df634be9a0073fa246d42e5c362a09f1734f3 Stats: 61 lines in 2 files changed: 61 ins; 0 del; 0 mod 8359602: Ideal optimizations depending on input type are missed because of missing notification mechanism from CCP Reviewed-by: epeter, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/26017 From jbhateja at openjdk.org Thu Jul 3 08:06:43 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 3 Jul 2025 08:06:43 GMT Subject: RFR: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 [v3] In-Reply-To: References: <6cWhCvx8g-Gx4VoBHW1wA7atsa_Eq5wBhkDolUbP_X0=.31f8e688-7401-4f81-9b50-46b1997e96b5@github.com> Message-ID: On Wed, 2 Jul 2025 23:02:37 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Adding comments > > Looks good to me. It will be good to get second review. Thanks @sviswa7 and @mhaessig ------------- PR Comment: https://git.openjdk.org/jdk/pull/26062#issuecomment-3031285321 From jbhateja at openjdk.org Thu Jul 3 08:06:44 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 3 Jul 2025 08:06:44 GMT Subject: Integrated: 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 10:08:20 GMT, Jatin Bhateja wrote: > Floating point division by zero is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value. > > While Java semantics defined in section 15.17.2 "Division Operator" of JLS-24 are well-defined for these constant-folding scenarios > > This bug fix patch fixes division by 0 error reported after integration of [JDK-8352635.](https://bugs.openjdk.org/browse/JDK-8352635) > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 2f683fdc Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/2f683fdc4a8f9c227e878b0d7fca645fc8abe1b6 Stats: 23 lines in 1 file changed: 23 ins; 0 del; 0 mod 8361037: [ubsan] compiler/c2/irTests/TestFloat16ScalarOperations division by 0 Reviewed-by: mhaessig, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/26062 From bmaillard at openjdk.org Thu Jul 3 08:09:41 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Thu, 3 Jul 2025 08:09:41 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v3] In-Reply-To: References: Message-ID: On Mon, 23 Jun 2025 12:39:23 GMT, Marc Chevalier wrote: >> A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. >> >> ## Pure Functions >> >> Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. >> >> ## Scope >> >> We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. >> >> ## Implementation Overview >> >> We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. >> >> This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > mostly comments src/hotspot/share/opto/parse2.cpp line 1100: > 1098: Node* Parse::floating_point_mod(Node* a, Node* b, BasicType type) { > 1099: assert(type == BasicType::T_FLOAT || type == BasicType::T_DOUBLE, "only float and double are floating points"); > 1100: CallLeafPureNode* mod = type == BasicType::T_DOUBLE ? static_cast(new ModDNode(C, a, b)) : new ModFNode(C, a, b); May I ask why we only need the `static_cast` for the `ModDNode` here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2180229177 From eastigeevich at openjdk.org Thu Jul 3 08:18:56 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 3 Jul 2025 08:18:56 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: <6CyXvRWJLHBSZxw6E0TJPva7X2RoqBZjE5b0q4oqVas=.b9a1e93d-5209-4cb0-b9b0-b1fac2e696e1@github.com> On Tue, 1 Jul 2025 16:05:07 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Simplify requirement for debug build I have rewritten the test not to use debug info at all. The test works with instructions instead. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3031319871 From eastigeevich at openjdk.org Thu Jul 3 08:18:56 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 3 Jul 2025 08:18:56 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v3] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Reimplement checking algo without using debug info ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/e91036bc..0b3320e6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=01-02 Stats: 139 lines in 1 file changed: 49 ins; 66 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From mchevalier at openjdk.org Thu Jul 3 08:21:41 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 3 Jul 2025 08:21:41 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v3] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 14:35:26 GMT, Beno?t Maillard wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> mostly comments > > src/hotspot/share/opto/parse2.cpp line 1100: > >> 1098: Node* Parse::floating_point_mod(Node* a, Node* b, BasicType type) { >> 1099: assert(type == BasicType::T_FLOAT || type == BasicType::T_DOUBLE, "only float and double are floating points"); >> 1100: CallLeafPureNode* mod = type == BasicType::T_DOUBLE ? static_cast(new ModDNode(C, a, b)) : new ModFNode(C, a, b); > > May I ask why we only need the `static_cast` for the `ModDNode` here? It's C/C++ being annoying here: both branches of the ternary must have the same type, or something compatible. If I remove the cast: error: conditional expression between distinct pointer types 'ModDNode*' and 'ModFNode*' lacks a cast With the case, C++ can converrt the `ModFNode*` into a `CallLeafPureNode*` just fine. I didn't invent the cast, it was here before, but good to question it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2182166387 From alanb at openjdk.org Thu Jul 3 08:39:40 2025 From: alanb at openjdk.org (Alan Bateman) Date: Thu, 3 Jul 2025 08:39:40 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining In-Reply-To: References: Message-ID: On Sun, 29 Jun 2025 15:26:14 GMT, Richard Reingruber wrote: > This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. > > Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. > > Testing: x86_64, ppc64 manually. Other major platforms as part of our CI testing. > > Failed inlining on x86_64 with TieredCompilation disabled: > > > make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 > > [...] > > STDOUT: > CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true > @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) > @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) > @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) > @ 1 java.lang.Object:: (1 bytes) inline (hot) > @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) > s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method > s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) > s @ 1 java.lang.StringBuffer::length (5 bytes) accessor > @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method > @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor > 2025-07-02T09:25:53.396634900Z Attempt 1, found: false > 2025-07-02T09:25:53.415673072Z Attempt 2, found: false > 2025-07-02T09:25:53.418876867Z Attempt 3, found: false > > [...] Thanks for improving this, this test was intended unstable. It might be that it could be updated to work with debug or -Xcomp too, execution times would need to be checked out. ------------- Marked as reviewed by alanb (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26033#pullrequestreview-2982264097 From mdoerr at openjdk.org Thu Jul 3 08:55:47 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 3 Jul 2025 08:55:47 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: <9gtw9iF8JY7RV3rnUau07YX5UfBJD5phY9yq_q16imE=.08ef8dc9-3ee5-4ea0-a5ea-661b5f12f9ed@github.com> On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. Tests are also green on our side. Let's ship it! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3031435423 From mdoerr at openjdk.org Thu Jul 3 08:55:47 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 3 Jul 2025 08:55:47 GMT Subject: [jdk25] Integrated: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: On Wed, 2 Jul 2025 10:54:13 GMT, Martin Doerr wrote: > This is a backout of [JDK-8258229](https://bugs.openjdk.org/browse/JDK-8258229) for JDK25 only. The problematic code has already been removed by [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) in JDK26. > > The backout is clean for the C++ code, but the test backout includes the backout of the follow-up change [JDK-8356310](https://bugs.openjdk.org/browse/JDK-8356310). > > Rationale: Minimize risk for JDK25. We should use the better fix JDK-8358821 in the long term. However, that one should get some more stabilization time before backporting it. Also see JBS issue. > > Proposed long term solution: Backport JDK-8358821 to jdk25u and revert this change again after an appropriate time. > > Short term: The issue solved by JDK-8258229 is not critical. It should be ok to postpone the fix to jdk25u. This pull request has now been integrated. Changeset: 993215f3 Author: Martin Doerr URL: https://git.openjdk.org/jdk/commit/993215f3dd7aba221da8c901117a8ff3f0ccb675 Stats: 93 lines in 2 files changed: 0 ins; 93 del; 0 mod 8361259: JDK25: Backout JDK-8258229 Reviewed-by: mhaessig, thartmann, dlong ------------- PR: https://git.openjdk.org/jdk/pull/26091 From mdoerr at openjdk.org Thu Jul 3 09:58:47 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 3 Jul 2025 09:58:47 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> Message-ID: <7x65TpFJJJ2dTdjZq__12fVXAlY2Ta7HYOUc17Oe0zQ=.8ed717d7-c89e-4a1b-ad12-08cabceadf28@github.com> On Thu, 3 Jul 2025 02:33:23 GMT, Dean Long wrote: > > > Makes sense, but according to the Developers' Guide, we can't do that because "A Bug or Enhancement with resolution Fixed is required to have a corresponding changeset in one of the OpenJDK repositories." > > > > > > [cf75f1f](https://github.com/openjdk/jdk/commit/cf75f1f9c6d2bc70c7133cb81c73a0ce0946dff9) is a corresponding changset. We can link it. > > So two bugs would reference the same changeset, but the changeset only names 8358821? It might be better to close 8357017 as a duplicate instead of as Fixed. I've closed it as duplicate and added comments to the issues. Do we need anything else like a reminder that we want to consider [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) backport? Is there a label for that? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3031647737 From mablakatov at openjdk.org Thu Jul 3 10:01:36 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Thu, 3 Jul 2025 10:01:36 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v7] In-Reply-To: References: Message-ID: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms > IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms > LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms > > > Fujitsu A64FX (SVE 512-bit): > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms > IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms > LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision: - Compare VL against MaxVectorSize instead of FloatRegister::sve_vl_max - Use a dedicated ptrue predicate register This shifts MulReduction performance on Neoverse V1 a bit. Here Before if before this specific commit (ebad6dd37e332da44222c50cd17c69f3ff3f0635) and After is this commit. | Benchmark | Before (ops/ms) | After (ops/ms) | Diff (%) | | ------------------------ | --------------- | -------------- | -------- | | ByteMaxVector.MULLanes | 9883.151 | 9093.557 | -7.99% | | DoubleMaxVector.MULLanes | 2712.674 | 2607.367 | -3.89% | | FloatMaxVector.MULLanes | 3388.811 | 3291.429 | -2.88% | | IntMaxVector.MULLanes | 4765.554 | 5031.741 | +5.58% | | LongMaxVector.MULLanes | 2685.228 | 2896.445 | +7.88% | | ShortMaxVector.MULLanes | 5128.185 | 5197.656 | +1.35% | ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23181/files - new: https://git.openjdk.org/jdk/pull/23181/files/ebad6dd3..d35f1089 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=05-06 Stats: 69 lines in 4 files changed: 12 ins; 17 del; 40 mod Patch: https://git.openjdk.org/jdk/pull/23181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181 PR: https://git.openjdk.org/jdk/pull/23181 From jbhateja at openjdk.org Thu Jul 3 10:06:44 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 3 Jul 2025 10:06:44 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v10] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 08:26:00 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments test/hotspot/jtreg/compiler/vectorapi/TestSelectFromTwoVectorOp.java line 234: > 232: > 233: @Test > 234: @IR(counts = {IRNode.SELECT_FROM_TWO_VECTOR_VS, IRNode.VECTOR_SIZE_8, ">0"}, Hi @Bhavana-Kilambi , Kindly also include x86-specific feature checks in IR rule for this test. You can directly integrate attached patch. [select_from_ir_feature.txt](https://github.com/user-attachments/files/21034639/select_from_ir_feature.txt) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2182389060 From mablakatov at openjdk.org Thu Jul 3 10:26:45 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Thu, 3 Jul 2025 10:26:45 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> <19rf4A0bxc4BstRmLivGkoCOm7Qa7YD6z1VJHJivCtg=.4a643c7b-4e79-4f37-b230-7231df3c68a8@github.com> Message-ID: On Wed, 2 Jul 2025 01:42:36 GMT, Xiaohong Gong wrote: >> Thanks! For some reason I thought that we don't have a dedicated predicate register for that. > > We can directly use `ptrue` here which maps to `p7` and has been preserved and initialized as all true. Done, although this has shifter the performance a bit: | Benchmark | Before (ops/ms) | After (ops/ms) | Diff (%) | | ------------------------ | --------------- | -------------- | -------- | | ByteMaxVector.MULLanes | 9883.151 | 9093.557 | -7.99% | | DoubleMaxVector.MULLanes | 2712.674 | 2607.367 | -3.89% | | FloatMaxVector.MULLanes | 3388.811 | 3291.429 | -2.88% | | IntMaxVector.MULLanes | 4765.554 | 5031.741 | +5.58% | | LongMaxVector.MULLanes | 2685.228 | 2896.445 | +7.88% | | ShortMaxVector.MULLanes | 5128.185 | 5197.656 | +1.35% | On average, the results didn't get worse. I suggest to merge the updated version as is as the shift seem to be related to micro-architectural effects not directly related to this PR and overall the PR still improves the performance by an order of magnitude (please reference https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067 for performance numbers before the PR) . I intent to closer investigate the reasons behind this later. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2182426692 From galder at openjdk.org Thu Jul 3 11:17:41 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 3 Jul 2025 11:17:41 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v2] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 12:02:07 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' into JDK-8361255-ctw-ncdfe > - Move clinit compile back > - Initial > - Fix Changes requested by galder (Author). test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java line 104: > 102: constructors = aClass.getDeclaredConstructors(); > 103: } catch (NoClassDefFoundError e) { > 104: CompileTheWorld.OUT.println(String.format("[%d]\t%s\tNOTE unable to get constructors : %s", Nitpick really but why not call `CompileTheWorld.OUT.printf(...` instead of `CompileTheWorld.OUT.println(String.format(...`? ------------- PR Review: https://git.openjdk.org/jdk/pull/26090#pullrequestreview-2982769212 PR Review Comment: https://git.openjdk.org/jdk/pull/26090#discussion_r2182520478 From mablakatov at openjdk.org Thu Jul 3 11:47:44 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Thu, 3 Jul 2025 11:47:44 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 16:22:42 GMT, Mikhail Ablakatov wrote: >> That would be the operations with partial vector size valid. For such cases, we will generate a mask in IR level, and a `VectorBlend` will be generated for this reduction case. Otherwise the result will be incorrect. So the vector size should be equal to MaxVectorSize theoretically. > > Thank you for elaborating on this :) Done, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2182576732 From dbriemann at openjdk.org Thu Jul 3 12:35:55 2025 From: dbriemann at openjdk.org (David Briemann) Date: Thu, 3 Jul 2025 12:35:55 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI Message-ID: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Implement more nodes for ppc that exist on other platforms. ------------- Commit messages: - 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI Changes: https://git.openjdk.org/jdk/pull/26115/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361353 Stats: 87 lines in 4 files changed: 86 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26115.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26115/head:pull/26115 PR: https://git.openjdk.org/jdk/pull/26115 From dnsimon at openjdk.org Thu Jul 3 13:04:19 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 3 Jul 2025 13:04:19 GMT Subject: RFR: 8361355: Negative cases of Annotated.getAnnotationData implementations are broken Message-ID: This PR fixes bugs in the implementation of `jdk.vm.ci.meta.Annotated.getAnnotationData`: 1. Calling `getAnnotatedData(annotationType)` fails with an ArrayIndexOutOfBoundsException instead of returning null when the receiver type is not annotated by `annotationType`. 2. Calling either of the `getAnnotatedData` methods with an `annotationType` value that does not represent an annotation interface silently succeeds when the receiver type does not (or can not) have any annotations (e.g. array and primitive types). ------------- Commit messages: - fixed negative cases in getAnnotationData Changes: https://git.openjdk.org/jdk/pull/26116/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26116&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361355 Stats: 100 lines in 7 files changed: 89 ins; 1 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/26116.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26116/head:pull/26116 PR: https://git.openjdk.org/jdk/pull/26116 From shade at openjdk.org Thu Jul 3 13:33:23 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 3 Jul 2025 13:33:23 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: > We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. > > The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. > > Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): > > > Before: Done (2487 classes, 9866 methods, 24584 ms) > After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods > > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Just use printf directly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26090/files - new: https://git.openjdk.org/jdk/pull/26090/files/9d41f80a..04fd5e50 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26090&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26090&range=01-02 Stats: 14 lines in 2 files changed: 0 ins; 0 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/26090.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26090/head:pull/26090 PR: https://git.openjdk.org/jdk/pull/26090 From shade at openjdk.org Thu Jul 3 13:33:24 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 3 Jul 2025 13:33:24 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v2] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 11:14:41 GMT, Galder Zamarre?o wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8361255-ctw-ncdfe >> - Move clinit compile back >> - Initial >> - Fix > > test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java line 104: > >> 102: constructors = aClass.getDeclaredConstructors(); >> 103: } catch (NoClassDefFoundError e) { >> 104: CompileTheWorld.OUT.println(String.format("[%d]\t%s\tNOTE unable to get constructors : %s", > > Nitpick really but why not call `CompileTheWorld.OUT.printf(...` instead of `CompileTheWorld.OUT.println(String.format(...`? Mostly because it was the style of the surrounding code. But I don't see why not use `printf` directly indeed, done in new commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26090#discussion_r2182798874 From dnsimon at openjdk.org Thu Jul 3 14:13:23 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 3 Jul 2025 14:13:23 GMT Subject: RFR: 8361355: Negative cases of Annotated.getAnnotationData implementations are broken [v2] In-Reply-To: References: Message-ID: > This PR fixes bugs in the implementation of `jdk.vm.ci.meta.Annotated.getAnnotationData`: > 1. Calling `getAnnotatedData(annotationType)` fails with an ArrayIndexOutOfBoundsException instead of returning null when the receiver type is not annotated by `annotationType`. > 2. Calling either of the `getAnnotatedData` methods with an `annotationType` value that does not represent an annotation interface silently succeeds when the receiver type does not (or can not) have any annotations (e.g. array and primitive types). Doug Simon has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: fixed negative cases in getAnnotationData ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26116/files - new: https://git.openjdk.org/jdk/pull/26116/files/86b41636..b25684f7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26116&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26116&range=00-01 Stats: 11 lines in 1 file changed: 5 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/26116.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26116/head:pull/26116 PR: https://git.openjdk.org/jdk/pull/26116 From mdoerr at openjdk.org Thu Jul 3 14:27:42 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 3 Jul 2025 14:27:42 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI In-Reply-To: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: <0lfxjYCRp6xFM8c_RDhbLEtbwM5J3huFxjcOqcWVykU=.908af2c8-2a2d-4719-b598-45b716ab8658@github.com> On Thu, 3 Jul 2025 12:30:51 GMT, David Briemann wrote: > Implement more nodes for ppc that exist on other platforms. Thanks for implementing these nodes! The new instruction needs a Power9 check. Otherwise, LGTM. src/hotspot/cpu/ppc/assembler_ppc.hpp line 2376: > 2374: inline void vctzw( VectorRegister d, VectorRegister b); > 2375: inline void vctzd( VectorRegister d, VectorRegister b); > 2376: inline void vnegw( VectorRegister d, VectorRegister b); A Power9 comment would be helpful to prevent wrong usage. src/hotspot/cpu/ppc/ppc.ad line 2196: > 2194: case Op_AbsVF: > 2195: case Op_AbsVD: > 2196: case Op_NegVI: vnegw requires Power9 (`PowerArchitecturePPC64 >= 9`). src/hotspot/cpu/ppc/ppc.ad line 13583: > 13581: > 13582: instruct vnegI_reg(vecX dst, vecX src) %{ > 13583: match(Set dst (NegVI src)); Should use a predicate for Power9. ------------- PR Review: https://git.openjdk.org/jdk/pull/26115#pullrequestreview-2983369466 PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2182917169 PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2182910035 PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2182926525 From rrich at openjdk.org Thu Jul 3 14:38:40 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 3 Jul 2025 14:38:40 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining In-Reply-To: References: Message-ID: <0p61J0DPfyHsen3r__V82eEZSPYaT9rZleHBtanKaRc=.c5f6992f-a7fe-4c95-bdcb-2887c3dbde21@github.com> On Thu, 3 Jul 2025 08:36:53 GMT, Alan Bateman wrote: > It might be that it could be updated to work with debug or -Xcomp too, execution times would need to be checked out. I found that the runtime of each test is ~300ms with a release build and ~11s with a fastdebug build on x86_64 and ppc64. If you like I can remove the requirement within this pr and do some more testing. -Xcomp doesn't seem to work. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26033#issuecomment-3032511575 From hgreule at openjdk.org Thu Jul 3 14:55:44 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Thu, 3 Jul 2025 14:55:44 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v2] In-Reply-To: References: Message-ID: On Thu, 26 Jun 2025 07:55:23 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > remove classfile version @iwanowww @eme64 as you reviewed the original change, could you have a look at this? Thank you very much. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25988#issuecomment-3032565530 From never at openjdk.org Thu Jul 3 14:59:39 2025 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 3 Jul 2025 14:59:39 GMT Subject: RFR: 8361355: Negative cases of Annotated.getAnnotationData implementations are broken [v2] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 14:13:23 GMT, Doug Simon wrote: >> This PR fixes bugs in the implementation of `jdk.vm.ci.meta.Annotated.getAnnotationData`: >> 1. Calling `getAnnotatedData(annotationType)` fails with an ArrayIndexOutOfBoundsException instead of returning null when the receiver type is not annotated by `annotationType`. >> 2. Calling either of the `getAnnotatedData` methods with an `annotationType` value that does not represent an annotation interface silently succeeds when the receiver type does not (or can not) have any annotations (e.g. array and primitive types). > > Doug Simon has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > fixed negative cases in getAnnotationData Looks good. ------------- Marked as reviewed by never (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26116#pullrequestreview-2983525519 From vpaprotski at openjdk.org Thu Jul 3 15:14:42 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Thu, 3 Jul 2025 15:14:42 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs In-Reply-To: References: Message-ID: <3R2flcCvwCbIMgCJqOVnrUXgAZJsi9Ja2r4is2tCnLg=.cab9d74a-7998-466c-9d24-8672f3f8883b@github.com> On Wed, 2 Jul 2025 23:28:42 GMT, Srinivas Vamsi Parasa wrote: >> @vamsi-parasa, It's better to make this as a subclass of MacroAssembler in src/hotspot/cpu/x86/macroAssembler_x86.hpp and pass Tracker as an argument to push / pop for a cleaner interface. > > Hi Jatin (@jatin-bhateja) and Vlad (@vpaprotsk), > > There's one more issue to be considered. The C++ PushPopTracker code will be run during the stub generation time. There are code bocks which do a single push onto the stack but due to multiple exit paths, there will be multiple pops as illustrated below. Will this reference counting approach not fail in such a scenario as the stub code is generated all at once during the stub generation phase? > > > #begin stack frame > push(r21) > > #exit condition 1 > pop(r21) > > # exit condition 2 > pop(r21) Now that I had my fun writing an array-backed stack.. (and with David's comment too..) I can admit that the point of the entire C++ Tracker class is to 'just' add an assert; doesn't actually functionally add to the original code, but does add better JIT/stub compile-time checking. @vamsi-parasa you are right.. if there are ifs and multiple exit paths in the assembler itself.. the Tracker wont be able to catch it (multiple exits paths in the generator are just fine though); I was thinking about this problem too last night... a hack/'solution' would be to disable such checking with a default flag in the constructor... 'fairly trivial' but just adds to the complexity even more. And the assert was the point of the class to begin with... I do think such stubs are rare? There is some value in improved checking, but enough? Writing stubs is already an 'you should know assembler very well' thing so those checks only improve things marginally overall? As David says, its for the compiler folks to decide :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2183043350 From shade at openjdk.org Thu Jul 3 16:23:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 3 Jul 2025 16:23:39 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: On Thu, 3 Jul 2025 01:59:55 GMT, hanguanqiang wrote: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. IMO, there is no point in fixing `-GenerateSynchronizationCode`, and instead we should just remove the flag. I propose we do this under the umbrella of this bug, just rename it to something like `Purge GenerateSynchronizationCode flag`. It is `develop`, so we don't even need a compatibility review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3032852570 From lmesnik at openjdk.org Thu Jul 3 16:59:40 2025 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Thu, 3 Jul 2025 16:59:40 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining In-Reply-To: References: Message-ID: On Sun, 29 Jun 2025 15:26:14 GMT, Richard Reingruber wrote: > This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. > > Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. > > Testing: x86_64, ppc64 manually. Other major platforms as part of our CI testing. > > Failed inlining on x86_64 with TieredCompilation disabled: > > > make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 > > [...] > > STDOUT: > CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true > @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) > @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) > @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) > @ 1 java.lang.Object:: (1 bytes) inline (hot) > @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) > s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method > s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) > s @ 1 java.lang.StringBuffer::length (5 bytes) accessor > @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method > @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor > 2025-07-02T09:25:53.396634900Z Attempt 1, found: false > 2025-07-02T09:25:53.415673072Z Attempt 2, found: false > 2025-07-02T09:25:53.418876867Z Attempt 3, found: false > > [...] Marked as reviewed by lmesnik (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26033#pullrequestreview-2983913131 From enikitin at openjdk.org Thu Jul 3 17:01:48 2025 From: enikitin at openjdk.org (Evgeny Nikitin) Date: Thu, 3 Jul 2025 17:01:48 GMT Subject: Integrated: 8357739: [jittester] disable the hashCode method In-Reply-To: References: Message-ID: On Tue, 17 Jun 2025 19:49:34 GMT, Evgeny Nikitin wrote: > JITTester often uses the `hasCode` method (in fact, in almost every generated test). Given that the method can be unstable between runs or in interpreted vs compiled runs, it can create false-positives. > > This PR fixes the issue by adding support for method templates similar to the ones used in CompilerCommands). All of those exclude templates match (and exclude) `String.indexOf(String)`, for example: > > java/lang/::*(Ljava/lang/String;I) > *String::indexOf(*) > java/lang/*::indexOf > > > Additionally, the PR adds support for comments (starting from '#') and empty lines in the excludes file. This pull request has now been integrated. Changeset: a2315ddd Author: Evgeny Nikitin Committer: Leonid Mesnik URL: https://git.openjdk.org/jdk/commit/a2315ddd2a343ed594dd1b0b3d0dc5b3a71f509b Stats: 556 lines in 4 files changed: 402 ins; 121 del; 33 mod 8357739: [jittester] disable the hashCode method Reviewed-by: lmesnik ------------- PR: https://git.openjdk.org/jdk/pull/25859 From dnsimon at openjdk.org Thu Jul 3 17:30:56 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Thu, 3 Jul 2025 17:30:56 GMT Subject: RFR: 8361355: Negative cases of Annotated.getAnnotationData implementations are broken [v3] In-Reply-To: References: Message-ID: <83aGkzmp5J7JllBsWK5ZzwZAa4GVsNk5VjmkH0O3FjE=.2507d7ce-65df-4121-acdf-35125d530d39@github.com> > This PR fixes bugs in the implementation of `jdk.vm.ci.meta.Annotated.getAnnotationData`: > 1. Calling `getAnnotatedData(annotationType)` fails with an ArrayIndexOutOfBoundsException instead of returning null when the receiver type is not annotated by `annotationType`. > 2. Calling either of the `getAnnotatedData` methods with an `annotationType` value that does not represent an annotation interface silently succeeds when the receiver type does not (or can not) have any annotations (e.g. array and primitive types). Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge tag 'jdk-26+4' into JDK-8361355 Added tag jdk-26+4 for changeset 1ca008fd - fixed negative cases in getAnnotationData ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26116/files - new: https://git.openjdk.org/jdk/pull/26116/files/b25684f7..ec161d59 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26116&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26116&range=01-02 Stats: 6616 lines in 362 files changed: 3437 ins; 1484 del; 1695 mod Patch: https://git.openjdk.org/jdk/pull/26116.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26116/head:pull/26116 PR: https://git.openjdk.org/jdk/pull/26116 From alanb at openjdk.org Thu Jul 3 17:59:44 2025 From: alanb at openjdk.org (Alan Bateman) Date: Thu, 3 Jul 2025 17:59:44 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining In-Reply-To: <0p61J0DPfyHsen3r__V82eEZSPYaT9rZleHBtanKaRc=.c5f6992f-a7fe-4c95-bdcb-2887c3dbde21@github.com> References: <0p61J0DPfyHsen3r__V82eEZSPYaT9rZleHBtanKaRc=.c5f6992f-a7fe-4c95-bdcb-2887c3dbde21@github.com> Message-ID: On Thu, 3 Jul 2025 14:36:15 GMT, Richard Reingruber wrote: > I found that the runtime of each test is ~300ms with a release build and ~11s with a fastdebug build on x86_64 and ppc64. If you like I can remove the requirement within this pr and do some more testing. -Xcomp doesn't seem to work. I think that would be useful, thank you. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26033#issuecomment-3033099720 From dlunden at openjdk.org Thu Jul 3 18:18:49 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 3 Jul 2025 18:18:49 GMT Subject: RFR: 8360701: Add bailout when the register allocator interference graph grows unreasonably large Message-ID: The changeset for JDK-8325467 (https://git.openjdk.org/jdk/pull/20404) enables compilation of methods with many parameters, which C2 previously bailed out on. As a side effect, the tests `BigArityTest.java`, `TestCatchExceptionWithVarargs.java`, and `VarargsArrayTest.java` compile more methods than before, and additionally these methods are designed, for stress testing purposes, to have a large number of parameters (at or close to the maximum of 255 parameters allowed by the JVM spec). Compiling such methods takes a very long time and >99% of the time is spent in the C2 phase Coalesce 2 (part of register allocation). The problem is that the interference graph becomes huge after the initial round of spilling (just before Coalesce 2), and that we do not check for this and bail out if necessary. We do already bail out if the number of IR nodes grows too large, but the interference graph can become huge even if we have a small number of nodes. In fact, the interference graph may (in the worst case) hava a size that is quadratic in the number of nodes. In the problematic tests, we have interference graphs with approximately 100 000 nodes and over 55 000 000 (!) IFG edges. For comparison, the IFG edge count in worst-case realistic scenarios caps out at around 40 000 nodes and 800 000 edges. For example, see the scatter matrix below from running the DaCapo benchmark. It displays, for each time an IFG was built, the number of current IR nodes, the number of live ranges (th e actual nodes in the IFG), and the number of IFG edges. ![dacapo](https://github.com/user-attachments/assets/7a070768-50da-42e4-b5ed-9958e1362673) ### Changeset - Add a new diagnostic flag `IFGEdgesLimit` and bail out whenever we reach the number of edges specified by the flag during IFG construction. The default is a very generous 10 000 000 edges, that still filters out the most degenerate compilations we have seen. - Add tracking of edges in `PhaseIFG` to permit the new flag. It is worth noting that it is perhaps preferable to use a lower default than 10 000 000 edges. For example, in standard benchmarks such as DaCapo (see the scatter matrix above), Renaissance, SPECjvm, and SPECjbb, we never go over 1 000 000 edges (I verified this). The reason I went with the generous 10 000 000 limit is that I saw a fair amount of bailouts in testing with the flag set at 1 000 000 edges. Such bailouts are likely motivated, but I do not want to take any chances. Even at 10 000 000 edges, a few tests still hit the limit with certain JVM flag combinations: - `applications/ctw/modules/java_base.java` - `compiler/codegen/TestAntiDependenciesHighMemUsage2.java` - `compiler/loopopts/superword/TestAlignVectorFuzzer.java` ### Testing - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/16047279249) - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. - C2 compilation speed benchmarking on DaCapo. Compilation speed is unaffected. ------------- Commit messages: - Bail out if too many IFG edges Changes: https://git.openjdk.org/jdk/pull/26118/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26118&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360701 Stats: 38 lines in 4 files changed: 37 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26118.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26118/head:pull/26118 PR: https://git.openjdk.org/jdk/pull/26118 From dlong at openjdk.org Thu Jul 3 20:43:44 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 3 Jul 2025 20:43:44 GMT Subject: [jdk25] RFR: 8361259: JDK25: Backout JDK-8258229 In-Reply-To: <7x65TpFJJJ2dTdjZq__12fVXAlY2Ta7HYOUc17Oe0zQ=.8ed717d7-c89e-4a1b-ad12-08cabceadf28@github.com> References: <-pGWrWHOZmgVzC7b3zYHIZDWjYin33X0ZEgj9GafA7E=.cea79e3f-9242-4e6a-95b0-7dad8b212b90@github.com> <7x65TpFJJJ2dTdjZq__12fVXAlY2Ta7HYOUc17Oe0zQ=.8ed717d7-c89e-4a1b-ad12-08cabceadf28@github.com> Message-ID: On Thu, 3 Jul 2025 09:55:51 GMT, Martin Doerr wrote: > I've closed it as duplicate and added comments to the issues. Thanks! > Do we need anything else like a reminder that we want to consider [JDK-8358821](https://bugs.openjdk.org/browse/ JDK-8358821) backport? Is there a label for that? The Developers' Guide says you can add a (Rel)-bp label to suggest a backport, so that would be "25-bp" for jdk25. If we definitely want to backport to particular release then we could create the Backport issue now as a placeholder. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26091#issuecomment-3033572554 From kvn at openjdk.org Thu Jul 3 22:53:40 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 3 Jul 2025 22:53:40 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 13:33:23 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Just use printf directly test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java line 89: > 87: UNSAFE.ensureClassInitialized(aClass); > 88: } catch (NoClassDefFoundError e) { > 89: CompileTheWorld.OUT.printf("[%d]\t%s\tNOTE unable to init class : %s%n", Do you mean `\n` here and in all other outputs? `%n` needs local variable to store size of output. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26090#discussion_r2183886728 From kvn at openjdk.org Thu Jul 3 23:16:39 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 3 Jul 2025 23:16:39 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: <9wyR3KHZTWl-cf7rOq7ryEiP4e2AsxCyrylrfcWnKfM=.adb77f9b-213d-4b07-8362-aa8e5601f527@github.com> On Thu, 3 Jul 2025 01:59:55 GMT, hanguanqiang wrote: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. I agree with removal of this flag. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3033917992 From duke at openjdk.org Fri Jul 4 00:27:39 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 00:27:39 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: On Thu, 3 Jul 2025 01:59:55 GMT, hanguanqiang wrote: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. Thank you all for the helpful feedback! I also think the GenerateSynchronizationCode flag is not particularly useful and can be removed. I will update this patch accordingly to eliminate the flag and simplify the related code. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3034005241 From duke at openjdk.org Fri Jul 4 01:15:13 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 01:15:13 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode [v2] In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. hanguanqiang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - remove the unused flag(GenerateSynchronizationCode) - 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode Problem? When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. Root Cause? Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. Fix Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. ------------- Changes: https://git.openjdk.org/jdk/pull/26108/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26108&range=01 Stats: 34 lines in 7 files changed: 10 ins; 16 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/26108.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26108/head:pull/26108 PR: https://git.openjdk.org/jdk/pull/26108 From duke at openjdk.org Fri Jul 4 01:26:42 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 01:26:42 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode [v3] In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: remove trailing whitespace remove trailing whitespace ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26108/files - new: https://git.openjdk.org/jdk/pull/26108/files/972f324b..d01533e1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26108&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26108&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26108.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26108/head:pull/26108 PR: https://git.openjdk.org/jdk/pull/26108 From duke at openjdk.org Fri Jul 4 01:30:22 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 01:30:22 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode [v4] In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: Delete .gitpod.yml ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26108/files - new: https://git.openjdk.org/jdk/pull/26108/files/d01533e1..1d6e8f5c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26108&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26108&range=02-03 Stats: 10 lines in 1 file changed: 0 ins; 10 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26108.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26108/head:pull/26108 PR: https://git.openjdk.org/jdk/pull/26108 From duke at openjdk.org Fri Jul 4 01:34:45 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 01:34:45 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode [v4] In-Reply-To: References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: On Thu, 3 Jul 2025 04:40:33 GMT, David Holmes wrote: >> hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: >> >> Delete .gitpod.yml > > The patch seems reasonable from a backporting perspective. Though it does beg the question as to why `do_monitor_enter` does not need the same fix. I suspect this is a very old flag and the code has bit-rotted somewhat. A question for the compiler folk: does `GenerateSynchronizationCode` still have any use or should it be scrapped? > > Thanks @dholmes-ora @dean-long @shipilev @vnkozlov Thanks for the previous reviews! I?ve updated the patch according to the suggestions. When you have a moment, could you please take another look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3034151673 From xgong at openjdk.org Fri Jul 4 01:37:42 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 4 Jul 2025 01:37:42 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> Message-ID: On Thu, 3 Jul 2025 06:10:28 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Refine the comment in ad file Hi @theRealAph , the review comments have been addressed. Would you mind taking another look please? Thank you so much! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3034155586 From xgong at openjdk.org Fri Jul 4 02:03:43 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 4 Jul 2025 02:03:43 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> <19rf4A0bxc4BstRmLivGkoCOm7Qa7YD6z1VJHJivCtg=.4a643c7b-4e79-4f37-b230-7231df3c68a8@github.com> Message-ID: On Thu, 3 Jul 2025 10:24:02 GMT, Mikhail Ablakatov wrote: >> We can directly use `ptrue` here which maps to `p7` and has been preserved and initialized as all true. > > Done, although this has shifter the performance a bit: > > > | Benchmark | Before (ops/ms) | After (ops/ms) | Diff (%) | > | ------------------------ | --------------- | -------------- | -------- | > | ByteMaxVector.MULLanes | 9883.151 | 9093.557 | -7.99% | > | DoubleMaxVector.MULLanes | 2712.674 | 2607.367 | -3.89% | > | FloatMaxVector.MULLanes | 3388.811 | 3291.429 | -2.88% | > | IntMaxVector.MULLanes | 4765.554 | 5031.741 | +5.58% | > | LongMaxVector.MULLanes | 2685.228 | 2896.445 | +7.88% | > | ShortMaxVector.MULLanes | 5128.185 | 5197.656 | +1.35% | > > > On average, the results didn't get worse. I suggest to merge the updated version as is as the shift seem to be related to micro-architectural effects not directly related to this PR and overall the PR still improves the performance by an order of magnitude (please reference https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067 for performance numbers before the PR) . I intent to closer investigate the reasons behind this later. I'm fine with the latest version because it saves the mask generation and a predicate temp register. The minor regressions are fine to me. BTW, Not sure whether the masked operation with partial lanes is more efficient compared with all lane computations. This maybe the HW micro-architecture implementation related issues. I didn't have an investigation for this before. Additionally, currently all the lanewise operations (e.g. `MulV/AddV/...`) with partial vector size are all implemented with `ptrue`. I agree with keeping it as it is, and taking an investigation for this later. Thanks for your updating! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2184132213 From duke at openjdk.org Fri Jul 4 02:55:27 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 02:55:27 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp Message-ID: When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. ------------- Commit messages: - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp Changes: https://git.openjdk.org/jdk/pull/26125/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26125&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361140 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26125/head:pull/26125 PR: https://git.openjdk.org/jdk/pull/26125 From fyang at openjdk.org Fri Jul 4 05:27:39 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 4 Jul 2025 05:27:39 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v2] In-Reply-To: References: Message-ID: On Thu, 26 Jun 2025 14:27:21 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses > - riscv: fix c1 primitive array clone intrinsic regression src/hotspot/share/c1/c1_Compiler.cpp line 240: > 238: #endif > 239: case vmIntrinsics::_getObjectSize: > 240: #if defined(X86) || defined(AARCH64) || defined(S390) || defined(RISCV64) || defined(PPC64) PS: The change of macro `RISCV` seems unrelated to this PR? Seem better to go with another PR. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2184446825 From shade at openjdk.org Fri Jul 4 06:04:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 4 Jul 2025 06:04:38 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 22:51:02 GMT, Vladimir Kozlov wrote: >> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: >> >> Just use printf directly > > test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java line 89: > >> 87: UNSAFE.ensureClassInitialized(aClass); >> 88: } catch (NoClassDefFoundError e) { >> 89: CompileTheWorld.OUT.printf("[%d]\t%s\tNOTE unable to init class : %s%n", > > Do you mean `\n` here and in all other outputs? `%n` needs local variable to store size of output. I meant `%n` :) You are probably thinking about C printf? In Java [formatters](https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html), `%n` is the "platform-specific line separator". It is more compatible than just `\n`, which runs into platform-specific `CR` vs `LF` vs `CRLF` line separator mess. See: jshell> System.out.printf("Hello\nthere,\nVladimir!\n") Hello there, Vladimir! $6 ==> java.io.PrintStream at 34c45dca jshell> System.out.printf("Hello%nthere,%nVladimir!%n") Hello there, Vladimir! $7 ==> java.io.PrintStream at 34c45dca ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26090#discussion_r2184484564 From jbhateja at openjdk.org Fri Jul 4 06:05:41 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 4 Jul 2025 06:05:41 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: Message-ID: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> On Thu, 3 Jul 2025 07:10:22 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Simplify the test code src/hotspot/share/opto/vectorIntrinsics.cpp line 707: > 705: elem_bt = converted_elem_bt; > 706: bits = gvn().longcon((bits_type->get_con() & 1L) == 0L ? 0L : -1L); > 707: } else if (!arch_supports_vector(opc, num_elem, elem_bt, checkFlags, true /*has_scalar_args*/)) { I think it's appropriate to make this change as part of VectorLongToMaskNode::Ideal routine to give the opportunity for this transformation during the Iterative GVN pass. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2184478552 From epeter at openjdk.org Fri Jul 4 06:13:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 4 Jul 2025 06:13:45 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v2] In-Reply-To: References: Message-ID: On Thu, 26 Jun 2025 07:55:23 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > remove classfile version Just a drive-by comment. Won't have time for a full review for a few weeks. test/hotspot/jtreg/compiler/c2/gvn/ReverseBytesConstantsTests.java line 23: > 21: * questions. > 22: */ > 23: package compiler.c2.gvn; Why did you remove the package? You can add the `jasm` file to the package too, I think that should work, no? ------------- PR Review: https://git.openjdk.org/jdk/pull/25988#pullrequestreview-2985704166 PR Review Comment: https://git.openjdk.org/jdk/pull/25988#discussion_r2184491532 From thartmann at openjdk.org Fri Jul 4 06:15:40 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 4 Jul 2025 06:15:40 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode [v4] In-Reply-To: References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: On Fri, 4 Jul 2025 01:30:22 GMT, hanguanqiang wrote: >> This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode >> >> Problem? >> When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. >> >> Root Cause? >> Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. >> >> Fix >> Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. > > hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: > > Delete .gitpod.yml Right, my intention when filing this bug was to remove the flag: https://bugs.openjdk.org/browse/JDK-8358568?focusedId=14786499&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14786499 I should have made that more explicit. Removal of this flag looks good to me. Changes requested by thartmann (Reviewer). src/hotspot/share/opto/callnode.cpp line 1456: > 1454: Node* top = Compile::current()->top(); > 1455: ins_req(nextmon, top); > 1456: ins_req(nextmon, top); Wait, this is wrong. The monitor inputs should not be set to top. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26108#pullrequestreview-2985715983 PR Review: https://git.openjdk.org/jdk/pull/26108#pullrequestreview-2985718643 PR Review Comment: https://git.openjdk.org/jdk/pull/26108#discussion_r2184500795 From jbhateja at openjdk.org Fri Jul 4 06:21:40 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 4 Jul 2025 06:21:40 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 07:10:22 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Simplify the test code Can you kindly include a micro with this patch? ``` public static final VectorSpecies FSP = FloatVector.SPECIES_512; public static long micro1(long a) { long mask = Math.min(-1, Math.max(-1, a)); return VectorMask.fromLong(FSP, mask).toLong(); } public static long micro2() { return FSP.maskAll(true).toLong(); } Your patch now removes L2M and M2L IR nodes. Baseline:- SPR2>java --add-modules=jdk.incubator.vector -Xbatch -XX:CompileCommand=PrintIdealPhase,test_mask_all::micro1,BEFORE_MATCHING -XX:-TieredCompilation -cp . test_mask_all 0 AFTER: BEFORE_MATCHING 65 ConL === 0 [[ 377 ]] #long:65535 369 Return === 5 6 7 8 9 returns 399 [[ 0 ]] 377 VectorLongToMask === _ 65 [[ 398 ]] #vectormask !jvms: VectorMask::fromLong @ bci:39 (line 243) test_mask_all::micro1 @ bci:18 (line 9) 398 VectorMaskCast === _ 377 [[ 399 ]] #vectormask !jvms: Float512Vector$Float512Mask::toLong @ bci:35 (line 765) test_mask_all::micro1 @ bci:21 (line 9) 399 VectorMaskToLong === _ 398 [[ 369 ]] #long !jvms: Float512Vector$Float512Mask::toLong @ bci:35 (line 765) test_mask_all::micro1 @ bci:21 (line 9) [time] 5 ms [res] 1310700000000 With patch:- XX:CompileCommand=PrintIdealPhase,test_mask_all::micro1,BEFORE_MATCHING -XX:-TieredCompilation -cp . test_mask_all 0 CompileCommand: PrintIdealPhase test_mask_all.micro1 const char* PrintIdealPhase = 'BEFORE_MATCHING' WARNING: Using incubator modules: jdk.incubator.vector AFTER: BEFORE_MATCHING 65 ConL === 0 [[ 369 ]] #long:65535 369 Return === 5 6 7 8 9 returns 65 [[ 0 ]] [time] 3 ms [res] 1310700000000 ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3034669174 From yongheng_hgq at 126.com Fri Jul 4 06:27:11 2025 From: yongheng_hgq at 126.com (h) Date: Fri, 4 Jul 2025 14:27:11 +0800 (CST) Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache Message-ID: <2dfbc1de.4cd5.197d41e17ac.Coremail.yongheng_hgq@126.com> Hi all, The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. Commit messages: - 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache Changes: https://github.com/openjdk/jdk/pull/26114/files Webrev: https://openjdk.github.io/cr/?repo=jdk&pr=26114&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8344548 Patch: https://git.openjdk.org/jdk/pull/26114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26114/head:pull/26114 PR: https://github.com/openjdk/jdk/pull/26114 BR -------------- next part -------------- An HTML attachment was scrubbed... URL: From hgreule at openjdk.org Fri Jul 4 06:36:44 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Fri, 4 Jul 2025 06:36:44 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v2] In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 06:04:42 GMT, Emanuel Peter wrote: >> Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: >> >> remove classfile version > > test/hotspot/jtreg/compiler/c2/gvn/ReverseBytesConstantsTests.java line 23: > >> 21: * questions. >> 22: */ >> 23: package compiler.c2.gvn; > > Why did you remove the package? You can add the `jasm` file to the package too, I think that should work, no? It seems like most files in the gvn folder don't have a package declaration, that's why I thought adjusting this way is fine. But I can also add it back and put the jasm file in the package too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25988#discussion_r2184531892 From duke at openjdk.org Fri Jul 4 06:43:02 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 06:43:02 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode [v5] In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: correct an error correct an error ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26108/files - new: https://git.openjdk.org/jdk/pull/26108/files/1d6e8f5c..6ebc2ecb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26108&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26108&range=03-04 Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26108.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26108/head:pull/26108 PR: https://git.openjdk.org/jdk/pull/26108 From duke at openjdk.org Fri Jul 4 06:47:39 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 06:47:39 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode [v4] In-Reply-To: References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: <2sjVycRIgOfB6aRtJMfVYVOB3iDnmD97Y-DbbjzupU8=.23cd6fc6-36c9-4d8b-8d15-f74a05c3cfe8@github.com> On Fri, 4 Jul 2025 06:12:44 GMT, Tobias Hartmann wrote: >> hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: >> >> Delete .gitpod.yml > > src/hotspot/share/opto/callnode.cpp line 1456: > >> 1454: Node* top = Compile::current()->top(); >> 1455: ins_req(nextmon, top); >> 1456: ins_req(nextmon, top); > > Wait, this is wrong. The monitor inputs should not be set to top. @TobiHartmann Thank you for pointing out the issue ? I?ve made the correction as suggested. Could you please take another look when you have time? Thanks again for your review and feedback ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26108#discussion_r2184546780 From dnsimon at openjdk.org Fri Jul 4 07:39:45 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 4 Jul 2025 07:39:45 GMT Subject: RFR: 8361355: Negative cases of Annotated.getAnnotationData implementations are broken [v3] In-Reply-To: <83aGkzmp5J7JllBsWK5ZzwZAa4GVsNk5VjmkH0O3FjE=.2507d7ce-65df-4121-acdf-35125d530d39@github.com> References: <83aGkzmp5J7JllBsWK5ZzwZAa4GVsNk5VjmkH0O3FjE=.2507d7ce-65df-4121-acdf-35125d530d39@github.com> Message-ID: On Thu, 3 Jul 2025 17:30:56 GMT, Doug Simon wrote: >> This PR fixes bugs in the implementation of `jdk.vm.ci.meta.Annotated.getAnnotationData`: >> 1. Calling `getAnnotatedData(annotationType)` fails with an ArrayIndexOutOfBoundsException instead of returning null when the receiver type is not annotated by `annotationType`. >> 2. Calling either of the `getAnnotatedData` methods with an `annotationType` value that does not represent an annotation interface silently succeeds when the receiver type does not (or can not) have any annotations (e.g. array and primitive types). > > Doug Simon has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge tag 'jdk-26+4' into JDK-8361355 > > Added tag jdk-26+4 for changeset 1ca008fd > - fixed negative cases in getAnnotationData Thanks for the review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26116#issuecomment-3034837278 From dnsimon at openjdk.org Fri Jul 4 07:39:46 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Fri, 4 Jul 2025 07:39:46 GMT Subject: Integrated: 8361355: Negative cases of Annotated.getAnnotationData implementations are broken In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 12:52:10 GMT, Doug Simon wrote: > This PR fixes bugs in the implementation of `jdk.vm.ci.meta.Annotated.getAnnotationData`: > 1. Calling `getAnnotatedData(annotationType)` fails with an ArrayIndexOutOfBoundsException instead of returning null when the receiver type is not annotated by `annotationType`. > 2. Calling either of the `getAnnotatedData` methods with an `annotationType` value that does not represent an annotation interface silently succeeds when the receiver type does not (or can not) have any annotations (e.g. array and primitive types). This pull request has now been integrated. Changeset: 5cf349c3 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/5cf349c3b08324e994a4143dcc34a59fd81323f9 Stats: 111 lines in 7 files changed: 94 ins; 1 del; 16 mod 8361355: Negative cases of Annotated.getAnnotationData implementations are broken Reviewed-by: never ------------- PR: https://git.openjdk.org/jdk/pull/26116 From shade at openjdk.org Fri Jul 4 08:09:53 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 4 Jul 2025 08:09:53 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 13:33:23 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Just use printf directly @TobiHartmann -- do you want to run this through CTW testing as well, to see if there are any new failures? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26090#issuecomment-3034913103 From rrich at openjdk.org Fri Jul 4 08:14:19 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 4 Jul 2025 08:14:19 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining [v2] In-Reply-To: References: Message-ID: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> > This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. > > Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. > > Testing: x86_64, ppc64 manually. Other major platforms as part of our CI testing. > > Failed inlining on x86_64 with TieredCompilation disabled: > > > make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 > > [...] > > STDOUT: > CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true > @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) > @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) > @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) > @ 1 java.lang.Object:: (1 bytes) inline (hot) > @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) > s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method > s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) > s @ 1 java.lang.StringBuffer::length (5 bytes) accessor > @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method > @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor > 2025-07-02T09:25:53.396634900Z Attempt 1, found: false > 2025-07-02T09:25:53.415673072Z Attempt 2, found: false > 2025-07-02T09:25:53.418876867Z Attempt 3, found: false > > [...] Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: Allow vm.debug ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26033/files - new: https://git.openjdk.org/jdk/pull/26033/files/8561d522..a43e54db Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26033&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26033&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26033.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26033/head:pull/26033 PR: https://git.openjdk.org/jdk/pull/26033 From rrich at openjdk.org Fri Jul 4 08:14:20 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 4 Jul 2025 08:14:20 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining [v2] In-Reply-To: References: <0p61J0DPfyHsen3r__V82eEZSPYaT9rZleHBtanKaRc=.c5f6992f-a7fe-4c95-bdcb-2887c3dbde21@github.com> Message-ID: On Thu, 3 Jul 2025 17:57:27 GMT, Alan Bateman wrote: > > I found that the runtime of each test is ~300ms with a release build and ~11s with a fastdebug build on x86_64 and ppc64. If you like I can remove the requirement within this pr and do some more testing. -Xcomp doesn't seem to work. > > I think that would be useful, thank you. I've removed the `!vm.debug` requirement. I'll await our local testing of the pr on a wider range of platforms. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26033#issuecomment-3034923279 From dbriemann at openjdk.org Fri Jul 4 08:16:59 2025 From: dbriemann at openjdk.org (David Briemann) Date: Fri, 4 Jul 2025 08:16:59 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v2] In-Reply-To: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: > Implement more nodes for ppc that exist on other platforms. David Briemann has updated the pull request incrementally with one additional commit since the last revision: add >= power9 check for NegVI ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26115/files - new: https://git.openjdk.org/jdk/pull/26115/files/d19e627d..00f37d7a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=00-01 Stats: 5 lines in 2 files changed: 4 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26115.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26115/head:pull/26115 PR: https://git.openjdk.org/jdk/pull/26115 From dbriemann at openjdk.org Fri Jul 4 08:16:59 2025 From: dbriemann at openjdk.org (David Briemann) Date: Fri, 4 Jul 2025 08:16:59 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v2] In-Reply-To: <0lfxjYCRp6xFM8c_RDhbLEtbwM5J3huFxjcOqcWVykU=.908af2c8-2a2d-4719-b598-45b716ab8658@github.com> References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> <0lfxjYCRp6xFM8c_RDhbLEtbwM5J3huFxjcOqcWVykU=.908af2c8-2a2d-4719-b598-45b716ab8658@github.com> Message-ID: On Thu, 3 Jul 2025 14:14:57 GMT, Martin Doerr wrote: >> David Briemann has updated the pull request incrementally with one additional commit since the last revision: >> >> add >= power9 check for NegVI > > src/hotspot/cpu/ppc/ppc.ad line 2196: > >> 2194: case Op_AbsVF: >> 2195: case Op_AbsVD: >> 2196: case Op_NegVI: > > vnegw requires Power9 (`PowerArchitecturePPC64 >= 9`). Thanks for catching that. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2184700391 From shade at openjdk.org Fri Jul 4 09:08:19 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 4 Jul 2025 09:08:19 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v5] In-Reply-To: References: Message-ID: > See bug for more discussion. > > This PR implements the "all the way" solution by removing the free list completely. It complements https://github.com/openjdk/jdk/pull/25364, and can go either first, or second. We will remerge the other one once either integrates. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler` > - [x] Linux AArch64 server fastdebug, `all` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - Merge branch 'master' into JDK-8357473-compile-task-free-list - Merge branch 'master' into JDK-8357473-compile-task-free-list - Merge branch 'master' into JDK-8357473-compile-task-free-list - Merge branch 'master' into JDK-8357473-compile-task-free-list - Also free the lock! - Comments and indenting - Basic deletion ------------- Changes: https://git.openjdk.org/jdk/pull/25409/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25409&range=04 Stats: 134 lines in 6 files changed: 27 ins; 68 del; 39 mod Patch: https://git.openjdk.org/jdk/pull/25409.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25409/head:pull/25409 PR: https://git.openjdk.org/jdk/pull/25409 From aph at openjdk.org Fri Jul 4 09:14:39 2025 From: aph at openjdk.org (Andrew Haley) Date: Fri, 4 Jul 2025 09:14:39 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> Message-ID: On Thu, 3 Jul 2025 06:10:28 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Refine the comment in ad file This looks good. Thanks. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26057#pullrequestreview-2986219682 From xgong at openjdk.org Fri Jul 4 09:17:40 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 4 Jul 2025 09:17:40 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> Message-ID: On Fri, 4 Jul 2025 09:11:40 GMT, Andrew Haley wrote: > This looks good. Thanks. Thanks so much for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3035115512 From shade at openjdk.org Fri Jul 4 09:29:13 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 4 Jul 2025 09:29:13 GMT Subject: RFR: 8361397: Rework CompileLog list synchronization Message-ID: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> I want to remove `CompileTaskAlloc_lock` completely with [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473), and for that we need to fix a stray use of that lock in CompileLog list linkage. We can rewrite that part to atomics. Additional testing: - [ ] Linux x86_64 server fastdebug, `compiler` ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/26127/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26127&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361397 Stats: 11 lines in 2 files changed: 4 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/26127.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26127/head:pull/26127 PR: https://git.openjdk.org/jdk/pull/26127 From shade at openjdk.org Fri Jul 4 09:29:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 4 Jul 2025 09:29:41 GMT Subject: RFR: 8358568: C2 compilation hits "must have a monitor" assert with -XX:-GenerateSynchronizationCode [v5] In-Reply-To: <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> Message-ID: On Fri, 4 Jul 2025 06:43:02 GMT, hanguanqiang wrote: >> This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode >> >> Problem? >> When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. >> >> Root Cause? >> Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. >> >> Fix >> Add a check in do_monitor_exit() to skip monitor unlocking if GenerateSynchronizationCode is false, avoiding invalid monitor access and preventing the crash. > > hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: > > correct an error > > correct an error I renamed the JBS bug, match the PR title, please. Also, go to https://github.com/hgqxjj/jdk/actions -- and enable the workflows. We need to have a clean GHA run before we can integrate. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3035170988 From mhaessig at openjdk.org Fri Jul 4 09:43:40 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 4 Jul 2025 09:43:40 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v2] In-Reply-To: References: Message-ID: On Thu, 26 Jun 2025 07:55:23 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > remove classfile version @SirYwell, thank you for fixing this. It looks good overall, but it would be good to add the package. I think we do this for all new tests. I kicked off some testing and will let you know about the results. src/hotspot/share/opto/subnode.cpp line 2031: > 2029: case Op_ReverseBytesUS: return TypeInt::make(byteswap(static_cast(con->is_int()->get_con()))); > 2030: case Op_ReverseBytesI: return TypeInt::make(byteswap(con->is_int()->get_con())); > 2031: case Op_ReverseBytesL: return TypeLong::make(byteswap(con->is_long()->get_con())); Why are you dropping the `checked_cast` here? Were they just an abundance of caution before? ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/25988#pullrequestreview-2986310035 PR Review Comment: https://git.openjdk.org/jdk/pull/25988#discussion_r2184863934 From mdoerr at openjdk.org Fri Jul 4 10:11:39 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 4 Jul 2025 10:11:39 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v2] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Fri, 4 Jul 2025 08:16:59 GMT, David Briemann wrote: >> Implement more nodes for ppc that exist on other platforms. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > add >= power9 check for NegVI I suggest removing the NegVI again. test/hotspot/jtreg/compiler/intrinsics/TestCompareUnsigned.java line 34: > 32: * @bug 8283726 8287925 > 33: * @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" | os.arch=="riscv64" | os.arch=="ppc64" | os.arch=="ppc64le" > 34: The test expects "CmpU3" for integers to be available. Can you implement that, too, please? ------------- Changes requested by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26115#pullrequestreview-2986428202 PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2184925416 From mdoerr at openjdk.org Fri Jul 4 10:11:40 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 4 Jul 2025 10:11:40 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v2] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> <0lfxjYCRp6xFM8c_RDhbLEtbwM5J3huFxjcOqcWVykU=.908af2c8-2a2d-4719-b598-45b716ab8658@github.com> Message-ID: On Fri, 4 Jul 2025 08:14:11 GMT, David Briemann wrote: >> src/hotspot/cpu/ppc/ppc.ad line 2196: >> >>> 2194: case Op_AbsVF: >>> 2195: case Op_AbsVD: >>> 2196: case Op_NegVI: >> >> vnegw requires Power9 (`PowerArchitecturePPC64 >= 9`). > > Thanks for catching that. I think we'd need to check that here, too. Otherwise we'd get "bad AD file" errors. However, there's another problem: vnegw computes the one?s-complement for each element, but we'd need two?s-complement. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2184936783 From shade at openjdk.org Fri Jul 4 10:18:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 4 Jul 2025 10:18:41 GMT Subject: RFR: 8358568: Purge obsolete/broken GenerateSynchronizationCode flag [v5] In-Reply-To: <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> Message-ID: On Fri, 4 Jul 2025 06:43:02 GMT, hanguanqiang wrote: >> This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode >> >> Problem? >> When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. >> >> Root Cause? >> Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. >> >> Fix >> Purge obsolete/broken GenerateSynchronizationCode flag > > hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: > > correct an error > > correct an error Looks good to me. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26108#pullrequestreview-2986479330 From alanb at openjdk.org Fri Jul 4 10:27:39 2025 From: alanb at openjdk.org (Alan Bateman) Date: Fri, 4 Jul 2025 10:27:39 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining [v2] In-Reply-To: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> References: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> Message-ID: On Fri, 4 Jul 2025 08:14:19 GMT, Richard Reingruber wrote: >> This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. >> >> Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. >> >> Testing: x86_64, ppc64 manually. Other major platforms as part of our CI testing. >> >> Failed inlining on x86_64 with TieredCompilation disabled: >> >> >> make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 >> >> [...] >> >> STDOUT: >> CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true >> @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) >> @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) >> @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) >> @ 1 java.lang.Object:: (1 bytes) inline (hot) >> @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) >> s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method >> s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) >> s @ 1 java.lang.StringBuffer::length (5 bytes) accessor >> @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method >> @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor >> 2025-07-02T09:25:53.396634900Z Attempt 1, found: false >> 2025-07-02T09:25:53.415673072Z Attempt 2, found: false >> 2025-07-02T09:25:53.418876867Z Attempt 3, found: false >> >> [...] > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Allow vm.debug Marked as reviewed by alanb (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26033#pullrequestreview-2986518539 From duke at openjdk.org Fri Jul 4 10:56:42 2025 From: duke at openjdk.org (erifan) Date: Fri, 4 Jul 2025 10:56:42 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> References: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> Message-ID: <6SXA9ZrXBDhZLyXP3lXbkpl4dl3iocvDpzPrUpIQOl8=.9b025be2-848b-4b78-a5e4-929cb7e9f798@github.com> On Fri, 4 Jul 2025 05:53:41 GMT, Jatin Bhateja wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Simplify the test code > > src/hotspot/share/opto/vectorIntrinsics.cpp line 707: > >> 705: elem_bt = converted_elem_bt; >> 706: bits = gvn().longcon((bits_type->get_con() & 1L) == 0L ? 0L : -1L); >> 707: } else if (!arch_supports_vector(opc, num_elem, elem_bt, checkFlags, true /*has_scalar_args*/)) { > > I think it's appropriate to make this change as part of VectorLongToMaskNode::Ideal routine to give the opportunity for this transformation during the Iterative GVN pass. Originally I also tried to implement it in IGVN, but later changed it to Intrinsic. For two reasons: 1. Implementing in intrinsic is relatively simpler and has better performance because it saves the process of generating `VectorLongToMaskNode`. 2. Implementing in intrinsic can support more cases. Because some architectures (such as aarch64 `NEON`) currently do not support the generation of `VectorLongToMaskNode,` but support `MaskAll` or `Replicate` nodes, if implemented in IGVN, then this optimization doesn't work for NEON. But implementing in Intrinsic can cover such cases. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2185045860 From dbriemann at openjdk.org Fri Jul 4 10:58:56 2025 From: dbriemann at openjdk.org (David Briemann) Date: Fri, 4 Jul 2025 10:58:56 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v3] In-Reply-To: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: > Implement more nodes for ppc that exist on other platforms. David Briemann has updated the pull request incrementally with one additional commit since the last revision: add CmpU3, ppc9 check in match_rule_supported ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26115/files - new: https://git.openjdk.org/jdk/pull/26115/files/00f37d7a..6d05a728 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=01-02 Stats: 17 lines in 1 file changed: 17 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26115.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26115/head:pull/26115 PR: https://git.openjdk.org/jdk/pull/26115 From thartmann at openjdk.org Fri Jul 4 11:05:43 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 4 Jul 2025 11:05:43 GMT Subject: RFR: 8358568: Purge obsolete/broken GenerateSynchronizationCode flag [v5] In-Reply-To: <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> Message-ID: On Fri, 4 Jul 2025 06:43:02 GMT, hanguanqiang wrote: >> This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode >> >> Problem? >> When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. >> >> Root Cause? >> Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. >> >> Fix >> Purge obsolete/broken GenerateSynchronizationCode flag > > hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: > > correct an error > > correct an error Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26108#pullrequestreview-2986682348 From duke at openjdk.org Fri Jul 4 11:10:40 2025 From: duke at openjdk.org (erifan) Date: Fri, 4 Jul 2025 11:10:40 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 06:18:02 GMT, Jatin Bhateja wrote: > public static final VectorSpecies FSP = FloatVector.SPECIES_512; public static long micro1(long a) { long mask = Math.min(-1, Math.max(-1, a)); return VectorMask.fromLong(FSP, mask).toLong(); } public static long micro2() { return FSP.maskAll(true).toLong(); } With this JMH method we can not see obvious performance improvement, because the hot spots are other instructions. Adding a loop is better. @Benchmark public long micro_3() { long result = 0; for (int i = 0; i < ITERATION; i++) { long mask = Math.min(-1, Math.max(-1, result)); result += VectorMask.fromLong(FSP, mask).toLong(); } return result; } But if it is not a floating point type, there will be no obvious performance improvement. Because the pattern `VectorMaskToLong(VectorLongToMask (l))` for integer types has been implemented, and `VectorMaskToLong(VectorMaskCast (VectorLongToMask (l)))` for floating-point types is not implemented. So if we add JMH benchmarks for this optimization, we can only see good performance gain from floating point types. So do you think it is necessary? @jatin-bhateja Thanks for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3035646085 From dbriemann at openjdk.org Fri Jul 4 11:22:59 2025 From: dbriemann at openjdk.org (David Briemann) Date: Fri, 4 Jul 2025 11:22:59 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v4] In-Reply-To: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: > Implement more nodes for ppc that exist on other platforms. David Briemann has updated the pull request incrementally with one additional commit since the last revision: fix typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26115/files - new: https://git.openjdk.org/jdk/pull/26115/files/6d05a728..a7c9f6be Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26115.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26115/head:pull/26115 PR: https://git.openjdk.org/jdk/pull/26115 From dbriemann at openjdk.org Fri Jul 4 11:31:24 2025 From: dbriemann at openjdk.org (David Briemann) Date: Fri, 4 Jul 2025 11:31:24 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v5] In-Reply-To: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: > Implement more nodes for ppc that exist on other platforms. David Briemann has updated the pull request incrementally with one additional commit since the last revision: adjust parameter types ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26115/files - new: https://git.openjdk.org/jdk/pull/26115/files/a7c9f6be..ebb27c9c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26115.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26115/head:pull/26115 PR: https://git.openjdk.org/jdk/pull/26115 From dlunden at openjdk.org Fri Jul 4 11:52:52 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Fri, 4 Jul 2025 11:52:52 GMT Subject: RFR: 8325467: Support methods with many arguments in C2 [v23] In-Reply-To: References: Message-ID: <-ZInU3fxIuRKYX9cUOJBCIq8gUHruo0qINmgjKWT_Dg=.aa7af726-2d75-4cbe-a113-d7ed396e19ed@github.com> On Mon, 23 Jun 2025 14:31:24 GMT, Daniel Lund?n wrote: >> If a method has a large number of parameters, we currently bail out from C2 compilation. >> >> ### Changeset >> >> Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes. >> >> Changes: >> - Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current `RegMask` interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. >> - Generalize the "chunk"-logic from `PhaseChaitin::Select()` to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the `PhaseChaitin::Select()` code more readable. >> - Remove all `can_represent` checks and bailouts. >> - Performance tuning. A particularly important change is the early-exit optimization in `RegMask::overlap`, used in the performance-sensitive method `PhaseChaitin::interfere_with_live`. >> - Add a new test case `TestManyMethodArguments.java` and extend an old test `TestNestedSynchronize.java`. >> >> ### Testing >> >> - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/10178060450) >> - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. >> - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement. >> - Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular `PhaseChaitin::remove_bound_register_from_interfering_live_ranges`). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, no... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Add clarifying comments at definitions of register mask sizes For reference, here is now the changeset adding an IFG bailout: https://github.com/openjdk/jdk/pull/26118 ------------- PR Comment: https://git.openjdk.org/jdk/pull/20404#issuecomment-3035850032 From mdoerr at openjdk.org Fri Jul 4 12:00:46 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Fri, 4 Jul 2025 12:00:46 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v5] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Fri, 4 Jul 2025 11:31:24 GMT, David Briemann wrote: >> Implement more nodes for ppc that exist on other platforms. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > adjust parameter types This looks correct, now. I only have a minor suggestion. src/hotspot/cpu/ppc/ppc.ad line 13599: > 13597: %} > 13598: > 13599: instruct vnegI_reg(vecX dst, vecX src) %{ Maybe call it vneg4I? That would be more consistent with the other nodes. src/hotspot/cpu/ppc/ppc.ad line 13601: > 13599: instruct vnegI_reg(vecX dst, vecX src) %{ > 13600: match(Set dst (NegVI src)); > 13601: predicate(PowerArchitecturePPC64 >= 9); We could also for check n->as_Vector()->length() == 4 or type int. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26115#pullrequestreview-2986896415 PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2185175993 PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2185177259 From jbhateja at openjdk.org Fri Jul 4 12:03:39 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 4 Jul 2025 12:03:39 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: <6SXA9ZrXBDhZLyXP3lXbkpl4dl3iocvDpzPrUpIQOl8=.9b025be2-848b-4b78-a5e4-929cb7e9f798@github.com> References: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> <6SXA9ZrXBDhZLyXP3lXbkpl4dl3iocvDpzPrUpIQOl8=.9b025be2-848b-4b78-a5e4-929cb7e9f798@github.com> Message-ID: On Fri, 4 Jul 2025 10:53:55 GMT, erifan wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 707: >> >>> 705: elem_bt = converted_elem_bt; >>> 706: bits = gvn().longcon((bits_type->get_con() & 1L) == 0L ? 0L : -1L); >>> 707: } else if (!arch_supports_vector(opc, num_elem, elem_bt, checkFlags, true /*has_scalar_args*/)) { >> >> I think it's appropriate to make this change as part of VectorLongToMaskNode::Ideal routine to give the opportunity for this transformation during the Iterative GVN pass. > > Originally I also tried to implement it in IGVN, but later changed it to Intrinsic. For two reasons: > > 1. Implementing in intrinsic is relatively simpler and has better performance because it saves the process of generating `VectorLongToMaskNode`. > 2. Implementing in intrinsic can support more cases. Because some architectures (such as aarch64 `NEON`) currently do not support the generation of `VectorLongToMaskNode,` but support `MaskAll` or `Replicate` nodes, if implemented in IGVN, then this optimization doesn't work for NEON. But implementing in Intrinsic can cover such cases. Hi @erifan , A few follow-up queries >> Implementing in intrinsic is relatively simpler and has better performance because it saves the process of generating VectorLongToMaskNode. What if during iterative GVN a constant -1 seeps through IR graph and gets connected to the input of VectorLongToMaskNode, you won't be able to create maskAll true in that case? >> Implementing intrinsic can support more cases. Because some architectures (such as aarch64 NEON) currently do not support the generation of VectorLongToMaskNode, but support MaskAll or Replicate nodes, if implemented in IGVN, then this optimization doesn't work for NEON. But implementing in Intrinsic can cover such cases. Do you see any advantage of doing this at intrinsic layer over entirely handling it in Java implimentation by simply modifying the opcode of fromBitsCoerced to MODE_BROADCAST from existing MODE_BITS_COERCED_LONG_TO_MASK for 0 or -1 input. https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorMask.java#L243 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2185179706 From jbhateja at openjdk.org Fri Jul 4 12:06:38 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 4 Jul 2025 12:06:38 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 07:10:22 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Simplify the test code > > public static final VectorSpecies FSP = FloatVector.SPECIES_512; > > public static long micro1(long a) { > > long mask = Math.min(-1, Math.max(-1, a)); > > return VectorMask.fromLong(FSP, mask).toLong(); > > } > > public static long micro2() { > > return FSP.maskAll(true).toLong(); > > } > > With this JMH method we can not see obvious performance improvement, because the hot spots are other instructions. Adding a loop is better. There is no hard and fast rule for the inclusion of a loop in a JMH micro in that case? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3035920476 From kevinw at openjdk.org Fri Jul 4 12:18:42 2025 From: kevinw at openjdk.org (Kevin Walls) Date: Fri, 4 Jul 2025 12:18:42 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining [v2] In-Reply-To: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> References: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> Message-ID: On Fri, 4 Jul 2025 08:14:19 GMT, Richard Reingruber wrote: >> This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. >> >> Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. >> >> Testing: x86_64, ppc64 manually. Other major platforms as part of our CI testing. >> >> Failed inlining on x86_64 with TieredCompilation disabled: >> >> >> make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 >> >> [...] >> >> STDOUT: >> CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true >> @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) >> @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) >> @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) >> @ 1 java.lang.Object:: (1 bytes) inline (hot) >> @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) >> s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method >> s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) >> s @ 1 java.lang.StringBuffer::length (5 bytes) accessor >> @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method >> @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor >> 2025-07-02T09:25:53.396634900Z Attempt 1, found: false >> 2025-07-02T09:25:53.415673072Z Attempt 2, found: false >> 2025-07-02T09:25:53.418876867Z Attempt 3, found: false >> >> [...] > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Allow vm.debug About the test and debug mode, we had that kind of conversation in https://github.com/openjdk/jdk/pull/25958 Windows and Macosx were likely to timeout in debug builds, Linux was OK for me. Not sure if the inlining requests here affect that much, will be interesting to see. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26033#issuecomment-3035981114 From mhaessig at openjdk.org Fri Jul 4 13:14:39 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 4 Jul 2025 13:14:39 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v2] In-Reply-To: References: Message-ID: On Thu, 26 Jun 2025 07:55:23 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > remove classfile version You forgot to add the new tests to the array of tests in `@Run`: stderr: Exception in thread "main" compiler.lib.ir_framework.shared.TestRunException: Test Failures (1) ----------------- Custom Run Test: @Run: runMethod - @Tests: {testI1,testI2,testI3,testL1,testL2,testL3,testS1,testS2,testS3,testUS1,testUS2,testUS3}: compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method public void ReverseBytesConstantsTests.runMethod() at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162) at compiler.lib.ir_framework.test.AbstractTest.run(AbstractTest.java:100) at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:89) at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:865) at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:255) at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:168) Caused by: java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:119) at java.base/java.lang.reflect.Method.invoke(Method.java:565) at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159) ... 5 more Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -24674 out of bounds for length 128 at java.base/java.lang.Character.valueOf(Character.java:9284) at ReverseBytesConstantsTests.assertResultUS(ReverseBytesConstantsTests.java:102) at ReverseBytesConstantsTests.runMethod(ReverseBytesConstantsTests.java:66) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) ... 7 more at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:901) at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:255) at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:168) ------------- PR Comment: https://git.openjdk.org/jdk/pull/25988#issuecomment-3036245144 From duke at openjdk.org Fri Jul 4 13:23:41 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 13:23:41 GMT Subject: RFR: 8358568: Purge obsolete/broken GenerateSynchronizationCode flag [v5] In-Reply-To: References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> Message-ID: On Fri, 4 Jul 2025 09:27:31 GMT, Aleksey Shipilev wrote: >> hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: >> >> correct an error >> >> correct an error > > I renamed the JBS bug, match the PR title, please. Also, go to https://github.com/hgqxjj/jdk/actions -- and enable the workflows. We need to have a clean GHA run before we can integrate. @shipilev @TobiHartmann Many thanks to both of you for reviewing ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3036267585 From shade at openjdk.org Fri Jul 4 13:35:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 4 Jul 2025 13:35:41 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v5] In-Reply-To: References: Message-ID: <8tf_dPZ9hexTA0unaFgAzyRqMW42z1lSRasRxySLlMU=.5cf326d4-894a-4f67-a9eb-f0c76e1bc3a9@github.com> On Fri, 4 Jul 2025 09:08:19 GMT, Aleksey Shipilev wrote: >> See bug for more discussion. >> >> This PR implements the "all the way" solution by removing the free list completely. It complements https://github.com/openjdk/jdk/pull/25364, and can go either first, or second. We will remerge the other one once either integrates. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` >> - [x] Linux AArch64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Also free the lock! > - Comments and indenting > - Basic deletion I would like to ditch the `CompileTaskAlloc_lock` completely, but that needs https://github.com/openjdk/jdk/pull/26127 to be done first. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25409#issuecomment-3036297988 From eastigeevich at openjdk.org Fri Jul 4 13:59:39 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Fri, 4 Jul 2025 13:59:39 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache In-Reply-To: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: On Thu, 3 Jul 2025 11:29:02 GMT, hanguanqiang wrote: > The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. > > This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. lgtm ------------- Marked as reviewed by eastigeevich (Committer). PR Review: https://git.openjdk.org/jdk/pull/26114#pullrequestreview-2987356786 From duke at openjdk.org Fri Jul 4 14:13:27 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 14:13:27 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache [v2] In-Reply-To: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: > The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. > > This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. hanguanqiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - correct a compile error - Merge remote-tracking branch 'upstream/master' into 8344548 - 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26114/files - new: https://git.openjdk.org/jdk/pull/26114/files/698a3f28..cb1b2c60 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26114&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26114&range=00-01 Stats: 3295 lines in 114 files changed: 2197 ins; 812 del; 286 mod Patch: https://git.openjdk.org/jdk/pull/26114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26114/head:pull/26114 PR: https://git.openjdk.org/jdk/pull/26114 From duke at openjdk.org Fri Jul 4 14:21:38 2025 From: duke at openjdk.org (hanguanqiang) Date: Fri, 4 Jul 2025 14:21:38 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache [v2] In-Reply-To: References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: <-bUAuPwNRRbf6d7qs2AJErsIlLJQbu9Hl0_ReKdUZ7A=.8473414f-b98b-4ca7-bf91-aca4ab0ccca5@github.com> On Fri, 4 Jul 2025 13:57:01 GMT, Evgeny Astigeevich wrote: >> hanguanqiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - correct a compile error >> - Merge remote-tracking branch 'upstream/master' into 8344548 >> - 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache >> >> The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is >> confusing and does not reflect the current implementation. >> >> This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. > > lgtm Many thanks to you @eastig for reviewing ------------- PR Comment: https://git.openjdk.org/jdk/pull/26114#issuecomment-3036461841 From duke at openjdk.org Fri Jul 4 16:28:38 2025 From: duke at openjdk.org (Samuel Chee) Date: Fri, 4 Jul 2025 16:28:38 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Fri, 27 Jun 2025 12:54:07 GMT, Andrew Haley wrote: >> I can double check with the herd7 simulator, but since the `casal` will always produce an acquire, to me it seems impossible that a load can be moved before the `casal` due the acquire within the `casal`. >> >> Clause 9 of before-barrier-ordering in the Arm Architecture reference manual also supports this. > >> Clause 9 of before-barrier-ordering in the Arm Architecture reference manual also supports this. > > Which clause is that? Hi @theRealAph, The clause can be find here, the last bullet point on this page - https://mozilla.github.io/pdf.js/web/viewer.html?file=https://documentation-service.arm.com/static/6839d7585475b403d943b4dc#page=255&pagemode=none Also, we have come up with two herd7 tests which should hopefully prove it to be alright. { x=0; y=0; 0:X1=x; 0:X3=y; 1:X1=x; 1:X3=y; } P0 | P1 ; MOV W0,#1 | MOV W0, #1 ; MOV W2,#2 | MOV W2, #2 ; CASAL W0, W2, [X1] | ; LDR W4,[X3] | STR W0, [X3] ; | DMB ISH ; | STR W2, [X1] ; exists (0:X0=2 /\ 0:X4=0) Here, the stores by P1 are happening in order: y = 1; x = 2; and the reads in P0 are happening by CASAL first - from x and then by LDR - from y. The condition constraint checks is that CASAL can't read 2 from x if LDR read 0 from y - the constraint should be fulfilled unless the reads are reordered. And { x = 1; y =1; 0: X1=x; 0:X3=y; 1: X3=x; 1:X1=y; } P0 | P1 ; MOV W0, #1 | MOV W0, #1; MOV W2, #2 | MOV W2, #2; CASAL W0, W2, [X1] | CASAL W0, W2, [X1]; LDR W4, [X3] | LDR W4, [X3]; exists (0:X4=1 /\ 1: X4=1) ``` Here, both X4's being equal to 1 is disallowed, as that would indicate that one of the ldrs was reordered before the CASAL. As the CASAL's will always succeed by default meaning atleast one of the LDR's will load a non-1 value into W4. Hence (0:X4=1 /\ 1: X4=1) can only ever occur if an ldr gets ordered before the CASAL. Hope this helps :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3036824254 From duke at openjdk.org Sat Jul 5 00:14:39 2025 From: duke at openjdk.org (hanguanqiang) Date: Sat, 5 Jul 2025 00:14:39 GMT Subject: RFR: 8358568: Purge obsolete/broken GenerateSynchronizationCode flag [v5] In-Reply-To: References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> Message-ID: On Fri, 4 Jul 2025 09:27:31 GMT, Aleksey Shipilev wrote: >> hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: >> >> correct an error >> >> correct an error > > I renamed the JBS bug, match the PR title, please. Also, go to https://github.com/hgqxjj/jdk/actions -- and enable the workflows. We need to have a clean GHA run before we can integrate. @shipilev @TobiHartmann The PR is ready to be integrated, but I don?t have the necessary permissions yet. Could you help with the integration? Thanks again ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3037460059 From shade at openjdk.org Sat Jul 5 05:47:51 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Sat, 5 Jul 2025 05:47:51 GMT Subject: RFR: 8358568: Purge obsolete/broken GenerateSynchronizationCode flag [v5] In-Reply-To: References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> Message-ID: On Fri, 4 Jul 2025 09:27:31 GMT, Aleksey Shipilev wrote: >> hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: >> >> correct an error >> >> correct an error > > I renamed the JBS bug, match the PR title, please. Also, go to https://github.com/hgqxjj/jdk/actions -- and enable the workflows. We need to have a clean GHA run before we can integrate. > @shipilev @TobiHartmann The PR is ready to be integrated, but I don?t have the necessary permissions yet. Could you help with the integration? Thanks again ! See what bots say here: https://github.com/openjdk/jdk/pull/26108#issuecomment-3030230202 -- you need to issue `/integrate` command, and someone would sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3038183835 From duke at openjdk.org Sat Jul 5 06:34:51 2025 From: duke at openjdk.org (duke) Date: Sat, 5 Jul 2025 06:34:51 GMT Subject: RFR: 8358568: Purge obsolete/broken GenerateSynchronizationCode flag [v5] In-Reply-To: <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> Message-ID: <7KfCIDJiq0UA0EcFyAiEyqPtShKrl6N-295Bu0DEI7E=.ffa395b6-779e-4a98-b933-854861b2a6b5@github.com> On Fri, 4 Jul 2025 06:43:02 GMT, hanguanqiang wrote: >> This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode >> >> Problem? >> When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. >> >> Root Cause? >> Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. >> >> Fix >> Purge obsolete/broken GenerateSynchronizationCode flag > > hanguanqiang has updated the pull request incrementally with one additional commit since the last revision: > > correct an error > > correct an error @hgqxjj Your change (at version 6ebc2ecb7b41da558a26400461b2e8084e915c3d) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3038269276 From duke at openjdk.org Sat Jul 5 06:40:39 2025 From: duke at openjdk.org (hanguanqiang) Date: Sat, 5 Jul 2025 06:40:39 GMT Subject: RFR: 8358568: Purge obsolete/broken GenerateSynchronizationCode flag [v5] In-Reply-To: References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> <8h5rro_3Sv9GLk3WSTstiTPXEj1dHPPHgCKowe-CIjk=.e15f5800-732f-4f25-81f8-ae86724bec2e@github.com> Message-ID: <_BqZWLEjPgBAb82HtarIENSKj5AuzFbILzidygCRm38=.7e18705d-aa5f-4e07-bcc1-496f75848441@github.com> On Sat, 5 Jul 2025 05:45:24 GMT, Aleksey Shipilev wrote: >> I renamed the JBS bug, match the PR title, please. Also, go to https://github.com/hgqxjj/jdk/actions -- and enable the workflows. We need to have a clean GHA run before we can integrate. > >> @shipilev @TobiHartmann The PR is ready to be integrated, but I don?t have the necessary permissions yet. Could you help with the integration? Thanks again ! > > See what bots say here: https://github.com/openjdk/jdk/pull/26108#issuecomment-3030230202 -- you need to issue `/integrate` command, and someone would sponsor. @shipilev Thanks for the reminder?i already issue /integrate , please help sponsor this change , really appreciate ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26108#issuecomment-3038287916 From dnsimon at openjdk.org Sat Jul 5 10:31:37 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Sat, 5 Jul 2025 10:31:37 GMT Subject: RFR: 8361417: JVMCI getModifiers incorrect for inner classes Message-ID: The result of `ResolvedJavaType.getModifiers()` should always have been the same as `Class.getModifiers()`. This is currently not the case for inner classes. Instead, the value is derived from `Klass::_access_flags` where as it should be derived from the `InnerClasses` attribute (as it is for `Class`). This PR aligns `ResolvedJavaType.getModifiers()` with `Class.getModifiers()`. ------------- Commit messages: - fix getModifiers() for inner classes Changes: https://git.openjdk.org/jdk/pull/26135/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26135&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361417 Stats: 71 lines in 7 files changed: 36 ins; 20 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/26135.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26135/head:pull/26135 PR: https://git.openjdk.org/jdk/pull/26135 From fgao at openjdk.org Sat Jul 5 14:04:39 2025 From: fgao at openjdk.org (Fei Gao) Date: Sat, 5 Jul 2025 14:04:39 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> Message-ID: On Fri, 4 Jul 2025 09:15:14 GMT, Xiaohong Gong wrote: >> This looks good. Thanks. > >> This looks good. Thanks. > > Thanks so much for your review! Hi @XiaohongGong, thanks for your work! Shall we also relax the IR check condition in the following cases for `aarch64` and `x86`? https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L254-L258 https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L376-L380 https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3038978749 From fgao at openjdk.org Sat Jul 5 15:11:40 2025 From: fgao at openjdk.org (Fei Gao) Date: Sat, 5 Jul 2025 15:11:40 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> Message-ID: <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> On Thu, 3 Jul 2025 06:10:28 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Refine the comment in ad file Have you measured the performance of this micro-benchmark on NEON machine? https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java#L251-L256 We added an limitation only for `int` before: https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L131-L134 Perhaps we also need to impose a similar limitation on `short` if the same regression occurs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3039090274 From fjiang at openjdk.org Sun Jul 6 13:22:47 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Sun, 6 Jul 2025 13:22:47 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v3] In-Reply-To: References: Message-ID: > Hi, please consider. > [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. > The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. > If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. > This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. > We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. > > This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. > The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. > > Test on linux-riscv64: > - [x] Tier1-3 > > JMH data on P550 SBC for reference (w/o and w/ the patch): > > Before: > > Without COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op > ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op > ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op > ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op > ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op > ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op > ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op > ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op > ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op > ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op > ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op > ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op > ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op > ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op > ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op > > ------------------------------------------------------------------------- > With COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.667 ? 0.622 ns/op > Arra... Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - Revert RISCV Macro modification - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses - riscv: fix c1 primitive array clone intrinsic regression ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25976/files - new: https://git.openjdk.org/jdk/pull/25976/files/be980424..3a502f84 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=01-02 Stats: 10623 lines in 438 files changed: 7013 ins; 1860 del; 1750 mod Patch: https://git.openjdk.org/jdk/pull/25976.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25976/head:pull/25976 PR: https://git.openjdk.org/jdk/pull/25976 From fjiang at openjdk.org Sun Jul 6 13:22:49 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Sun, 6 Jul 2025 13:22:49 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v2] In-Reply-To: References: Message-ID: <7KGEqkzMGveZ_lLtIcC0YwwHqmUri7L3_v7J6aVLmQM=.089fc97c-f09a-4220-87cc-a30d6dd10536@github.com> On Fri, 4 Jul 2025 05:25:08 GMT, Fei Yang wrote: >> Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses >> - riscv: fix c1 primitive array clone intrinsic regression > > src/hotspot/share/c1/c1_Compiler.cpp line 240: > >> 238: #endif >> 239: case vmIntrinsics::_getObjectSize: >> 240: #if defined(X86) || defined(AARCH64) || defined(S390) || defined(RISCV64) || defined(PPC64) > > PS: The change of macro `RISCV` seems unrelated to this PR? Seem better to go with another PR. Reverted. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2188269110 From xgong at openjdk.org Mon Jul 7 02:07:50 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 7 Jul 2025 02:07:50 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: <7GrGfBF_v8F0v02sRHC78ofMZwpMdzQZaHeYlNvi_N0=.93defb9e-ca9b-41b4-8722-1746692e2316@github.com> Message-ID: On Wed, 2 Jul 2025 08:24:22 GMT, Emanuel Peter wrote: >>> Agree with Paul, these are minor regressions. Let us proceed with this patch. >> >> Thanks so much for your review @sviswa7 ! > > @XiaohongGong I quickly scanned the patch, it looks good to me too. I'm submitting some internal testing now, to make sure our extended testing does not break on integration. Should take about 24h. Hi @eme64 , may I ask how the testing is going on? Can we move on and integrate this patch now? Thanks a lot! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3043273111 From dzhang at openjdk.org Mon Jul 7 02:35:11 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 7 Jul 2025 02:35:11 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call Message-ID: Hi, please consider this code cleanup change for native call. This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. This also removes several unnecessary code blob related runtime checks turning them into assertions. ### Testing * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build ------------- Commit messages: - 8361449: RISC-V: Code cleanup for native call Changes: https://git.openjdk.org/jdk/pull/26150/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26150&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361449 Stats: 39 lines in 3 files changed: 5 ins; 12 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/26150.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26150/head:pull/26150 PR: https://git.openjdk.org/jdk/pull/26150 From dzhang at openjdk.org Mon Jul 7 03:05:25 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 7 Jul 2025 03:05:25 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: References: Message-ID: > Hi, please consider this code cleanup change for native call. > > This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. > This also removes several unnecessary code blob related runtime checks turning them into assertions. > > ### Testing > * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Remove outdated comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26150/files - new: https://git.openjdk.org/jdk/pull/26150/files/4054dac0..d7ff8e53 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26150&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26150&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26150.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26150/head:pull/26150 PR: https://git.openjdk.org/jdk/pull/26150 From duke at openjdk.org Mon Jul 7 03:46:38 2025 From: duke at openjdk.org (erifan) Date: Mon, 7 Jul 2025 03:46:38 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: Message-ID: <3cr8Njt2flaQXy5sjOVOlhI9XDkEesagnYLwzCmgkoI=.089494aa-d622-47db-8d23-c9637519028c@github.com> On Fri, 4 Jul 2025 12:04:06 GMT, Jatin Bhateja wrote: > > > public static final VectorSpecies FSP = FloatVector.SPECIES_512; > > > public static long micro1(long a) { > > > long mask = Math.min(-1, Math.max(-1, a)); > > > return VectorMask.fromLong(FSP, mask).toLong(); > > > } > > > public static long micro2() { > > > return FSP.maskAll(true).toLong(); > > > } > > > > > > With this JMH method we can not see obvious performance improvement, because the hot spots are other instructions. Adding a loop is better. > > There is no hard and fast rule for the inclusion of a loop in a JMH micro in that case? You mean adding a loop is not a block, right ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3043388905 From duke at openjdk.org Mon Jul 7 03:46:39 2025 From: duke at openjdk.org (erifan) Date: Mon, 7 Jul 2025 03:46:39 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> <6SXA9ZrXBDhZLyXP3lXbkpl4dl3iocvDpzPrUpIQOl8=.9b025be2-848b-4b78-a5e4-929cb7e9f798@github.com> Message-ID: On Fri, 4 Jul 2025 11:59:23 GMT, Jatin Bhateja wrote: > What if during iterative GVN a constant -1 seeps through IR graph and gets connected to the input of VectorLongToMaskNode, you won't be able to create maskAll true in that case? Yes, this PR doesn't support this case. Maybe we should do this optimization in `ideal`. If `VectorLongToMask` is not supported, then try to convert it to `maskAll` or `Replicate` in intrinsic. > Do you see any advantage of doing this at intrinsic layer over entirely handling it in Java implimentation by simply modifying the opcode of fromBitsCoerced to MODE_BROADCAST from existing MODE_BITS_COERCED_LONG_TO_MASK for 0 or -1 input. I had tried this method and gave it up, because it has up to 34% performance regression for specific cases on x64. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2188903923 From duke at openjdk.org Mon Jul 7 05:25:45 2025 From: duke at openjdk.org (guanqiang han) Date: Mon, 7 Jul 2025 05:25:45 GMT Subject: Integrated: 8358568: Purge obsolete/broken GenerateSynchronizationCode flag In-Reply-To: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> References: <3V2zC8zIDHEUvZMM8ibpoeRjd8FOEjuQDOSzWrQKsZc=.d652b38e-f890-4d78-b4b7-7e4d3a9f3bde@github.com> Message-ID: On Thu, 3 Jul 2025 01:59:55 GMT, guanqiang han wrote: > This PR fixes JDK-8358568, a JVM crash triggered when running with -XX:-GenerateSynchronizationCode > > Problem? > When synchronization code generation is disabled by -XX:-GenerateSynchronizationCode, the compiler?s do_monitor_exit() method still tries to access monitor objects without checking if any monitors exist.This causes an assertion failure and JVM crash. > > Root Cause? > Parse::do_monitor_exit() calls shared_unlock() using monitor info unconditionally,but with GenerateSynchronizationCode disabled, no monitor info is available, leading to invalid access. > > Fix > Purge obsolete/broken GenerateSynchronizationCode flag This pull request has now been integrated. Changeset: 45300dd1 Author: hanguanqiang Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/45300dd1234c9aa92d6b82f1ef2b05b949b1ea9f Stats: 23 lines in 6 files changed: 0 ins; 17 del; 6 mod 8358568: Purge obsolete/broken GenerateSynchronizationCode flag Reviewed-by: thartmann, shade ------------- PR: https://git.openjdk.org/jdk/pull/26108 From epeter at openjdk.org Mon Jul 7 06:04:45 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 7 Jul 2025 06:04:45 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 09:16:48 GMT, Xiaohong Gong wrote: >> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]). >> >> Two key areas require improvement: >> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance. >> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance. >> >> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures. >> >> Main changes: >> 1. Java-side API refactoring: >> - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on >> architectures like AArch64. >> 2. C2 compiler IR refactoring: >> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types. >> 3. Backend changes: >> - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level. >> >> Performance: >> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below: >> >> Benchmark Mode Cnt Unit SIZE Before After Gain >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98 >> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.31... > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Address review comments > - Merge 'jdk:master' into JDK-8355563 > - 8355563: VectorAPI: Refactor current implementation of subword gather load API @XiaohongGong Thanks for putting in the work! Tests pass, and patch looks reasonable. ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25138#pullrequestreview-2992282121 From epeter at openjdk.org Mon Jul 7 06:21:48 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Mon, 7 Jul 2025 06:21:48 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 10:08:23 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: > > - Address more comments > > ATT. > - Merge branch 'master' into JDK-8354242 > - Support negating unsigned comparison for BoolTest::mask > > Added a static method `negate_mask(mask btm)` into BoolTest class to > negate both signed and unsigned comparison. > - Addressed some review comments > - Merge branch 'master' into JDK-8354242 > - Refactor the JTReg tests for compare.xor(maskAll) > > Also made a bit change to support pattern `VectorMask.fromLong()`. > - Merge branch 'master' into JDK-8354242 > - Refactor code > > Add a new function XorVNode::Ideal_XorV_VectorMaskCmp to do this > optimization, making the code more modular. > - Merge branch 'master' into JDK-8354242 > - Update the jtreg test > - ... and 5 more: https://git.openjdk.org/jdk/compare/78e42324...5ebdc572 src/hotspot/share/opto/vectornode.cpp line 2243: > 2241: !VectorNode::is_all_ones_vector(in2)) { > 2242: return nullptr; > 2243: } Suggestion: if (in1->Opcode() != Op_VectorMaskCmp || in1->outcnt() != 1 || !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || !VectorNode::is_all_ones_vector(in2)) { return nullptr; } Indentation for clarity. Currently, you exiting if one of these is the case: 1. Not `MaskCmp` 2. More than one use 3. predicate cannot be negated AND the vector is all ones. Can you explain this condition? Do you have tests for cases: - predicate negatable and vector not all ones - predircate not negatable and vector not all ones - predicate negatable and vector all ones - predicate not negatable and vectors all ones Why do you guard against `VectorNode::is_all_ones_vector(in2)` at all? The condition for 3. is easy to get wrong, so good testing is important here :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2189075462 From xgong at openjdk.org Mon Jul 7 06:55:46 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 7 Jul 2025 06:55:46 GMT Subject: RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2] In-Reply-To: References: <7GrGfBF_v8F0v02sRHC78ofMZwpMdzQZaHeYlNvi_N0=.93defb9e-ca9b-41b4-8722-1746692e2316@github.com> Message-ID: On Mon, 7 Jul 2025 02:05:06 GMT, Xiaohong Gong wrote: >> @XiaohongGong I quickly scanned the patch, it looks good to me too. I'm submitting some internal testing now, to make sure our extended testing does not break on integration. Should take about 24h. > > Hi @eme64 , may I ask how the testing is going on? Can we move on and integrate this patch now? Thanks a lot! > @XiaohongGong Thanks for putting in the work! > > Tests pass, and patch looks reasonable. Thanks so much for your review and test! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3043679621 From xgong at openjdk.org Mon Jul 7 06:55:47 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 7 Jul 2025 06:55:47 GMT Subject: Integrated: 8355563: VectorAPI: Refactor current implementation of subword gather load API In-Reply-To: References: Message-ID: On Fri, 9 May 2025 07:35:41 GMT, Xiaohong Gong wrote: > JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]). > > Two key areas require improvement: > 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance. > 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance. > > This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures. > > Main changes: > 1. Java-side API refactoring: > - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on > architectures like AArch64. > 2. C2 compiler IR refactoring: > - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types. > 3. Backend changes: > - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level. > > Performance: > The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below: > > Benchmark Mode Cnt Unit SIZE Before After Gain > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98 > GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.312 935.269 1.02 > GatherOperationsBenchmark.micr... This pull request has now been integrated. Changeset: d75ea7e6 Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/d75ea7e67951275fe27f1e137c961f39d779a046 Stats: 450 lines in 15 files changed: 109 ins; 176 del; 165 mod 8355563: VectorAPI: Refactor current implementation of subword gather load API Reviewed-by: epeter, psandoz, sviswanathan, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/25138 From xgong at openjdk.org Mon Jul 7 07:01:39 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 7 Jul 2025 07:01:39 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> Message-ID: <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> On Sat, 5 Jul 2025 15:08:35 GMT, Fei Gao wrote: > Have you measured the performance of this micro-benchmark on NEON machine? > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java#L251-L256 > > We added an limitation only for `int` before: > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L131-L134 > > Perhaps we also need to impose a similar limitation on `short` if the same regression occurs. Good catch, and thanks so much for your input @fg1417 ! I will test the performance and disable auto-vectorization for double to short casting if the performance has regression. > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3043711086 From dbriemann at openjdk.org Mon Jul 7 07:30:19 2025 From: dbriemann at openjdk.org (David Briemann) Date: Mon, 7 Jul 2025 07:30:19 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: > Implement more nodes for ppc that exist on other platforms. David Briemann has updated the pull request incrementally with one additional commit since the last revision: rename instruction, add extra predicate cond for type int ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26115/files - new: https://git.openjdk.org/jdk/pull/26115/files/ebb27c9c..b65400a9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26115&range=04-05 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26115.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26115/head:pull/26115 PR: https://git.openjdk.org/jdk/pull/26115 From rrich at openjdk.org Mon Jul 7 07:44:39 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 7 Jul 2025 07:44:39 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining [v2] In-Reply-To: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> References: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> Message-ID: On Fri, 4 Jul 2025 08:14:19 GMT, Richard Reingruber wrote: >> This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. >> >> Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. >> >> Testing: x86_64, ppc64 manually. Other major platforms as part of our CI testing. >> >> Failed inlining on x86_64 with TieredCompilation disabled: >> >> >> make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 >> >> [...] >> >> STDOUT: >> CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true >> @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) >> @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) >> @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) >> @ 1 java.lang.Object:: (1 bytes) inline (hot) >> @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) >> s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method >> s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) >> s @ 1 java.lang.StringBuffer::length (5 bytes) accessor >> @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method >> @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor >> 2025-07-02T09:25:53.396634900Z Attempt 1, found: false >> 2025-07-02T09:25:53.415673072Z Attempt 2, found: false >> 2025-07-02T09:25:53.418876867Z Attempt 3, found: false >> >> [...] > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Allow vm.debug > About the test and debug mode, we had that kind of conversation in #25958 Windows and Macosx were likely to timeout in debug builds, Linux was OK for me. Not sure if the inlining requests here affect that much, will be interesting to see. You got timeouts likely because the test searched for eliminated locking in the thread dumps in an endless loop but lock elimination has failed (very early) due to unfavorable inlining. Inlining depends on timing because jit compilation runs asynchronously in the background. It affects inlining if a call target is already compiled into a big nmethod (see `failed to inline: already compiled into a big method` above). Calls critical for lock elimination will still be inlined because of the `CompileCommand` inlining requests. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26033#issuecomment-3043813314 From rrich at openjdk.org Mon Jul 7 07:44:40 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 7 Jul 2025 07:44:40 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining [v2] In-Reply-To: References: <0p61J0DPfyHsen3r__V82eEZSPYaT9rZleHBtanKaRc=.c5f6992f-a7fe-4c95-bdcb-2887c3dbde21@github.com> Message-ID: On Fri, 4 Jul 2025 08:11:12 GMT, Richard Reingruber wrote: > I've removed the `!vm.debug` requirement. I'll await our local testing of the pr on a wider range of platforms. Local testing was good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26033#issuecomment-3043828120 From thartmann at openjdk.org Mon Jul 7 07:45:46 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 7 Jul 2025 07:45:46 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v2] In-Reply-To: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> References: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> Message-ID: On Tue, 1 Jul 2025 16:14:00 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Remove useless loop Nice analysis! In general, the fix looks good to me. I added a few comments / suggestions. src/hotspot/share/opto/library_call.cpp line 1732: > 1730: return false; > 1731: } > 1732: destruct_map_clone(old_state.map); I think `destruct_map_clone` could be refactored to take a `SavedState`. src/hotspot/share/opto/library_call.cpp line 2376: > 2374: state.map = clone_map(); > 2375: for (DUIterator_Fast imax, i = control()->fast_outs(imax); i < imax; i++) { > 2376: Node* out = control()->fast_out(i); Could we have a similar issue with non-control users? For example, couldn't we also have stray memory users after bailout? src/hotspot/share/opto/library_call.cpp line 2393: > 2391: Node* out = control()->fast_out(i); > 2392: if (out->is_CFG() && out->in(0) == control() && out != map() && !state.ctrl_succ.member(out)) { > 2393: out->set_req(0, C->top()); Could `out` already be in the GVN hash ("remove node from hash table before modifying it")? src/hotspot/share/opto/library_call.hpp line 129: > 127: virtual int reexecute_sp() { return _reexecute_sp; } > 128: > 129: struct SavedState { Please add a comment describing what it's used for. test/hotspot/jtreg/compiler/intrinsics/VectorIntoArrayInvalidControlFlow.java line 2: > 1: /* > 2: * Copyright (c) 2021, Oracle and/or its affiliates. All rights reserved. Suggestion: * Copyright (c) 2025, Oracle and/or its affiliates. All rights reserved. test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 104: > 102: public static final String START = "(\\d+(\\s){2}("; > 103: public static final String MID = ".*)+(\\s){2}===.*"; > 104: public static final String END = ")"; I don't like exposing these outside the IR framework but then again I don't really have an idea on how to check the "graph should not have both nodes" invariant. Maybe we should extend the `counts` annotation to support something like `@IR(counts = {IRNode.CallStaticJava, IRNode.OpaqueNotNull, "<= 1"} [...]`? ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25936#pullrequestreview-2992473824 PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2189175998 PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2189211960 PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2189198041 PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2189172691 PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2189212910 PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2189244934 From chagedorn at openjdk.org Mon Jul 7 08:01:45 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 7 Jul 2025 08:01:45 GMT Subject: RFR: 8342941: IGV: Add new graph dumps for post loop, empty loop removal, and one iteration removal [v2] In-Reply-To: References: Message-ID: On Thu, 12 Jun 2025 22:47:43 GMT, Saranya Natarajan wrote: >> This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). >> >> Changes: >> - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. >> - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. >> - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. >> >> Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . >> 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` >> ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) >> 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled >> ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) >> 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` >> ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) >> 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` >> ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) >> >> Question to reviewers: >> Are the new compiler phases OK, or should we change anything? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > Addressing review comments Otherwise, looks good, thanks! You should merge in latest master to resolve the conflicts. src/hotspot/share/opto/phasetype.hpp line 79: > 77: flags(AFTER_REMOVE_EMPTY_LOOP, "After Remove Empty Loops") \ > 78: flags(BEFORE_ONE_ITERATION_LOOP, "Before Replacing One Iteration Loops") \ > 79: flags(AFTER_ONE_INTERATION_LOOP, "After Replacing One Iteration Loops") \ There is a typo for `AFTER_ONE_INTERATION_LOOP` -> `ITERATION` Nit: We only apply it for one loop and thus you can remove the trailing `s`. Suggestion: flags(BEFORE_POST_LOOP, "Before Post Loop") \ flags(AFTER_POST_LOOP, "After Post Loop") \ flags(BEFORE_REMOVE_EMPTY_LOOP, "Before Remove Empty Loop") \ flags(AFTER_REMOVE_EMPTY_LOOP, "After Remove Empty Loop") \ flags(BEFORE_ONE_ITERATION_LOOP, "Before Replacing One Iteration Loop") \ flags(AFTER_ONE_ITERATION_LOOP, "After Replacing One Iteration Loop") \ ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25756#pullrequestreview-2992625351 PR Review Comment: https://git.openjdk.org/jdk/pull/25756#discussion_r2189277869 From duke at openjdk.org Mon Jul 7 08:19:42 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Mon, 7 Jul 2025 08:19:42 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v8] In-Reply-To: <5e1o1xtN0ZdQZGJi2aVmgCEApW625koeE9F53VhDi5E=.2390045d-844e-4800-8d4b-075a2a3a8793@github.com> References: <5e1o1xtN0ZdQZGJi2aVmgCEApW625koeE9F53VhDi5E=.2390045d-844e-4800-8d4b-075a2a3a8793@github.com> Message-ID: On Mon, 5 May 2025 10:17:27 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: > > change slli+add sequence to shadd . ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3043933851 From dzhang at openjdk.org Mon Jul 7 08:28:41 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 7 Jul 2025 08:28:41 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 03:05:25 GMT, Dingli Zhang wrote: >> Hi, please consider this code cleanup change for native call. >> >> This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. >> This also removes several unnecessary code blob related runtime checks turning them into assertions. >> >> ### Testing >> * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Remove outdated comments Hi @robehn , could you help to review this patch? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26150#issuecomment-3043958394 From bkilambi at openjdk.org Mon Jul 7 08:31:48 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 7 Jul 2025 08:31:48 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v10] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 08:26:00 GMT, Hao Sun wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments > > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 5159: > >> 5157: // consecutive. The match rules for SelectFromTwoVector reserve two consecutive vector registers >> 5158: // for src1 and src2. >> 5159: // Four combinations of vector registers each for vselect_from_two_vectors_HS_Neon and > > I suppose the function names are changed now. Should use `select_from_two_vectors_Neon` and `select_from_two_vectors_SVE` instead. Hi @shqking , the match rule names still begin with `vselect_from_two_vectors_Neon_*`. `select_from_two_vectors_Neon` and `select_from_two_vectors_SVE` are routines in the MacroAssembler. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2189356707 From thartmann at openjdk.org Mon Jul 7 08:35:50 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 7 Jul 2025 08:35:50 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v3] In-Reply-To: References: Message-ID: On Mon, 23 Jun 2025 12:39:23 GMT, Marc Chevalier wrote: >> A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. >> >> ## Pure Functions >> >> Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. >> >> ## Scope >> >> We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. >> >> ## Implementation Overview >> >> We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. >> >> This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > mostly comments Thanks for digging into this Marc. The changes look good to me. I just added a few minor comments / questions. > this PR doesn't propose a way to move pure calls around Should we have a separate RFE for that? src/hotspot/share/opto/divnode.cpp line 1522: > 1520: Node* super = CallLeafPureNode::Ideal(phase, can_reshape); > 1521: if (super != nullptr) { > 1522: return super; Can't we just do `return CallLeafPureNode::Ideal(phase, can_reshape);` at the end of `ModFNode::Ideal` instead of `return nullptr`? That's what we usually do in C2, for example in `CallStaticJavaNode::Ideal` -> `CallNode::Ideal`. Feels more natural to me and would avoid the `super != nullptr` check. Also for the other `Ideal` methods that you modified. src/hotspot/share/opto/divnode.cpp line 1528: > 1526: bool not_dead = proj_out_or_null(TypeFunc::Control) != nullptr; > 1527: if (result_is_unused && not_dead) { > 1528: return replace_with_con(igvn, TypeF::make(0.)); Can we replace all the other usages of `ModFloatingNode::replace_with_con` by `TupleNode` and get rid of that method? src/hotspot/share/opto/graphKit.cpp line 1916: > 1914: if (call->is_CallLeafPure()) { > 1915: // Pure function have only control (for now) and data output, in particular > 1916: // the don't touch the memory, so we don't want a memory proj that is set after. Suggestion: // they don't touch the memory, so we don't want a memory proj that is set after. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25760#pullrequestreview-2992602888 PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2189268498 PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2189357774 PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2189261824 From mdoerr at openjdk.org Mon Jul 7 08:39:44 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 7 Jul 2025 08:39:44 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: <0Y-EMD27kUQdvljb8SUfAX09BZmFrhUAEMpA6aHsiEI=.b21bf4dd-ca36-405c-a8ab-044c0cb35749@github.com> On Mon, 7 Jul 2025 07:30:19 GMT, David Briemann wrote: >> Implement more nodes for ppc that exist on other platforms. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > rename instruction, add extra predicate cond for type int Thanks! ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26115#pullrequestreview-2992762483 From bkilambi at openjdk.org Mon Jul 7 08:40:42 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 7 Jul 2025 08:40:42 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v10] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 10:04:26 GMT, Jatin Bhateja wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments > > test/hotspot/jtreg/compiler/vectorapi/TestSelectFromTwoVectorOp.java line 234: > >> 232: >> 233: @Test >> 234: @IR(counts = {IRNode.SELECT_FROM_TWO_VECTOR_VS, IRNode.VECTOR_SIZE_8, ">0"}, > > Hi @Bhavana-Kilambi , > Kindly also include x86-specific feature checks in IR rules for this test. > > You can directly integrate attached patch. > > [select_from_ir_feature.txt](https://github.com/user-attachments/files/21034639/select_from_ir_feature.txt) Thank you @jatin-bhateja . Will do that in my next patch with other changes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2189374185 From shade at openjdk.org Mon Jul 7 09:07:32 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 7 Jul 2025 09:07:32 GMT Subject: RFR: 8361397: Rework CompileLog list synchronization [v2] In-Reply-To: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> References: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> Message-ID: > I want to remove `CompileTaskAlloc_lock` completely with [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473), and for that we need to fix a stray use of that lock in CompileLog list linkage. We can rewrite that part to atomics. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into JDK-8361397-compilelog-list - Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26127/files - new: https://git.openjdk.org/jdk/pull/26127/files/4df91936..3eec06a4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26127&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26127&range=00-01 Stats: 1617 lines in 64 files changed: 790 ins; 601 del; 226 mod Patch: https://git.openjdk.org/jdk/pull/26127.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26127/head:pull/26127 PR: https://git.openjdk.org/jdk/pull/26127 From jbhateja at openjdk.org Mon Jul 7 09:07:40 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 7 Jul 2025 09:07:40 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: <3cr8Njt2flaQXy5sjOVOlhI9XDkEesagnYLwzCmgkoI=.089494aa-d622-47db-8d23-c9637519028c@github.com> References: <3cr8Njt2flaQXy5sjOVOlhI9XDkEesagnYLwzCmgkoI=.089494aa-d622-47db-8d23-c9637519028c@github.com> Message-ID: On Mon, 7 Jul 2025 03:43:44 GMT, erifan wrote: > > > > public static final VectorSpecies FSP = FloatVector.SPECIES_512; > > > > public static long micro1(long a) { > > > > long mask = Math.min(-1, Math.max(-1, a)); > > > > return VectorMask.fromLong(FSP, mask).toLong(); > > > > } > > > > public static long micro2() { > > > > return FSP.maskAll(true).toLong(); > > > > } > > > > > > > > > With this JMH method we can not see obvious performance improvement, because the hot spots are other instructions. Adding a loop is better. > > > > > > There is no hard and fast rule for the inclusion of a loop in a JMH micro in that case? > > You mean adding a loop is not a block, right ? Yes. If you see gains without loop go for it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3044086773 From jbhateja at openjdk.org Mon Jul 7 09:11:40 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 7 Jul 2025 09:11:40 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> <6SXA9ZrXBDhZLyXP3lXbkpl4dl3iocvDpzPrUpIQOl8=.9b025be2-848b-4b78-a5e4-929cb7e9f798@github.com> Message-ID: On Mon, 7 Jul 2025 03:42:27 GMT, erifan wrote: > > What if during iterative GVN a constant -1 seeps through IR graph and gets connected to the input of VectorLongToMaskNode, you won't be able to create maskAll true in that case? > > Yes, this PR doesn't support this case. Maybe we should do this optimization in `ideal`. If `VectorLongToMask` is not supported, then try to convert it to `maskAll` or `Replicate` in intrinsic. > I would suggest extending VectorLongToMaskNode::Ideal for completeness of the solution. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2189445358 From duke at openjdk.org Mon Jul 7 09:35:47 2025 From: duke at openjdk.org (erifan) Date: Mon, 7 Jul 2025 09:35:47 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 06:19:15 GMT, Emanuel Peter wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: >> >> - Address more comments >> >> ATT. >> - Merge branch 'master' into JDK-8354242 >> - Support negating unsigned comparison for BoolTest::mask >> >> Added a static method `negate_mask(mask btm)` into BoolTest class to >> negate both signed and unsigned comparison. >> - Addressed some review comments >> - Merge branch 'master' into JDK-8354242 >> - Refactor the JTReg tests for compare.xor(maskAll) >> >> Also made a bit change to support pattern `VectorMask.fromLong()`. >> - Merge branch 'master' into JDK-8354242 >> - Refactor code >> >> Add a new function XorVNode::Ideal_XorV_VectorMaskCmp to do this >> optimization, making the code more modular. >> - Merge branch 'master' into JDK-8354242 >> - Update the jtreg test >> - ... and 5 more: https://git.openjdk.org/jdk/compare/8e600a2f...5ebdc572 > > src/hotspot/share/opto/vectornode.cpp line 2243: > >> 2241: !VectorNode::is_all_ones_vector(in2)) { >> 2242: return nullptr; >> 2243: } > > Suggestion: > > if (in1->Opcode() != Op_VectorMaskCmp || > in1->outcnt() != 1 || > !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || > !VectorNode::is_all_ones_vector(in2)) { > return nullptr; > } > > Indentation for clarity. > > Currently, you exiting if one of these is the case: > 1. Not `MaskCmp` > 2. More than one use > 3. predicate cannot be negated AND the vector is all ones. Can you explain this condition? Do you have tests for cases: > - predicate negatable and vector not all ones > - predircate not negatable and vector not all ones > - predicate negatable and vector all ones > - predicate not negatable and vectors all ones > > Why do you guard against `VectorNode::is_all_ones_vector(in2)` at all? > > The condition for 3. is easy to get wrong, so good testing is important here :) The current testing status for the conditions you listed: > 1. Not MaskCmp. **No test for it, tested locally**, Because I think this condition is too straightforward. > 2. More than one use. **Tested**, see `VectorMaskCompareNotTest.java line 1118`. > predicate negatable and vector not all ones. **Tested**, see `VectorMaskCompareNotTest.java line 1126`. > predicate not negatable and vector not all ones. **No test for it**, because we have tests for `predicate not negatable` or `vector not all ones`. If either is `false`, return nullptr. > predicate negatable and vector all ones. **A lot of tests for it**. For example `VectorMaskCompareNotTest.java line 1014`. > predicate not negatable and vectors all ones. **Tested**, see `VectorMaskCompareNotTest.java line 1222`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2189495935 From bkilambi at openjdk.org Mon Jul 7 10:27:39 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 7 Jul 2025 10:27:39 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> Message-ID: On Mon, 7 Jul 2025 06:59:20 GMT, Xiaohong Gong wrote: >> Have you measured the performance of this micro-benchmark on NEON machine? >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java#L251-L256 >> >> We added an limitation only for `int` before: >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L131-L134 >> >> Perhaps we also need to impose a similar limitation on `short` if the same regression occurs. > >> Have you measured the performance of this micro-benchmark on NEON machine? >> >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java#L251-L256 >> >> We added an limitation only for `int` before: >> >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L131-L134 >> >> Perhaps we also need to impose a similar limitation on `short` if the same regression occurs. > > Good catch, and thanks so much for your input @fg1417 ! I will test the performance and disable auto-vectorization for double to short casting if the performance has regression. > >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. Hi @XiaohongGong, is there any way we can implement 2HF -> 2S and 2S -> 2HF in these match rules ? https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4697 https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4679 The `fcvtn` and `fcvtl` instructions do not support these arrangements. I was wondering if there is any other way we can implement these by any chance? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3044358446 From mhaessig at openjdk.org Mon Jul 7 10:29:54 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 7 Jul 2025 10:29:54 GMT Subject: RFR: 8360175: C2 crash: assert(edge_from_to(prior_use,n)) failed: before block local scheduling Message-ID: The triggered assert is part of the schedule verification code that runs just before machine code is emitted. The debug output showed that a `leaPCompressedOopOffset` node was causing the assert, which suggested the peephole optimization introduced in #25471 as the cause. The failure proved quite difficult to reproduce. It failed more often on Windows and required `-XX:+UseKNLSetting` (forces code generation for Intel's Knights Landing platform), which forces `-XX:+OptoScheduling`. The root-cause is a subtle bug in the rewiring of the base edge of `leaP*` nodes in the `remove_redundant_lea` peephole. When the peephole removed a `decodeHeapOop_not_null` including a spill, it did not set the base edge of the `leaP*` node to the same node as the address edge, which is the intent of the peephole, but to the parent node of the spill. That is not catastrophic in most cases, but might reference another register slot, which causes this assert. Concretely, we see the following graph MemToRegSpillCopy | | | MemToRegSpillCopy | | DefiniinoSpillCopy | | | | decodeHeapOop_not_null | | leaPCompressedHeapOop gets rewired to MemToRegSpillCopy | | DefinitionSpillCopy | | | leaPCompressedHeapOop instead of MemToRegSpillCopy | DefinitionSpillCopy / \ leaPCompressedHeapOop This PR fixes this by always setting the base edge of the `leaP*` node to the same node as the address edge. Unfortunately, I was not able to construct a regression test because of the difficulty of reproducing the bug. # Testing - [ ] Github Actions - [x] tier1,tier2 plus internal testing on all Oracle supported platforms - [x] tier3,tier4,tier5 plus internal testing on Linux and Windows x64 - [ ] Runthese8H on `windows-x64-debug` (test that reliably produced the failure addressed in this PR) ------------- Commit messages: - Fix spill removal in redundant lea peephole Changes: https://git.openjdk.org/jdk/pull/26157/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26157&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360175 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26157.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26157/head:pull/26157 PR: https://git.openjdk.org/jdk/pull/26157 From chagedorn at openjdk.org Mon Jul 7 10:57:41 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 7 Jul 2025 10:57:41 GMT Subject: RFR: 8358641: C1 option -XX:+TimeEachLinearScan is broken In-Reply-To: References: Message-ID: <5pAfa0USJhWAZPirSkCCFKqDSSrv889NPkbipk68AH8=.37dd341d-58c6-46a7-b6d0-8b2000e2f04b@github.com> On Mon, 23 Jun 2025 09:43:28 GMT, Saranya Natarajan wrote: > **Issue** > Using the command` java -Xcomp -XX:TieredStopAtLevel=1 -XX:+TimeEachLinearScan` results in an assert failure in line `assert(_cached_blocks.length() == ir()->linear_scan_order()->length()) failed: invalid cached block list`. > > **Suggestion** > Removal of flag as this is a very old issue > > **Fix** > Removed the flag by removing relevant methods and code while ensuring the removal does not affect other flags. That looks reasonable, thanks for cleaning it up! src/hotspot/share/c1/c1_LinearScan.cpp line 6453: > 6451: case counter_other_inst: return "misc. instructions"; > 6452: > 6453: Might have been added by a previous iteration. I suggest to remove it. Suggestion: ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25933#pullrequestreview-2993286243 PR Review Comment: https://git.openjdk.org/jdk/pull/25933#discussion_r2189710385 From chagedorn at openjdk.org Mon Jul 7 11:13:39 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 7 Jul 2025 11:13:39 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 14:44:03 GMT, Manuel H?ssig wrote: > `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. > > Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. > > Testing: > - [x] Github Actions > - [x] tier1, tier2 plus Oracle internal testing > - [x] `TestRedundantLea.java` on Alpine Linux Looks good, thanks for fixing it! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26046#pullrequestreview-2993334779 From chagedorn at openjdk.org Mon Jul 7 12:16:40 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 7 Jul 2025 12:16:40 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 02:49:27 GMT, guanqiang han wrote: > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. Changes requested by chagedorn (Reviewer). src/hotspot/share/opto/escape.cpp line 981: > 979: if (!OptimizePtrCompare) { > 980: return; > 981: } Thanks for working on this! IIUC, having the bailout here will fail to reduce the phi which could be unexpected. Shouldn't we just return `UNKNOWN` from within `ConnectionGraph::optimize_ptr_compare()` when we run without `OptimizePtrCompare`? On a separate note, can you also add a regression test? Maybe you can also just add a run with `-XX:-OptimizePtrCompare` - maybe together with `-XX:+VerifyReduceAllocationMerges` for more verification - to `compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java`. @JohnTortugo you might also want to have a look at this. ------------- PR Review: https://git.openjdk.org/jdk/pull/26125#pullrequestreview-2993546963 PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2189878464 From duke at openjdk.org Mon Jul 7 13:07:55 2025 From: duke at openjdk.org (Andrej =?UTF-8?B?UGXEjWltw7p0aA==?=) Date: Mon, 7 Jul 2025 13:07:55 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal Message-ID: This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. ------------- Commit messages: - Remove an unused import. - Use a pattern variable. - use list when applicable - JVMCI refactorings to enable replay compilation in Graal. Changes: https://git.openjdk.org/jdk/pull/25433/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25433&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8357689 Stats: 434 lines in 21 files changed: 310 ins; 8 del; 116 mod Patch: https://git.openjdk.org/jdk/pull/25433.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25433/head:pull/25433 PR: https://git.openjdk.org/jdk/pull/25433 From duke at openjdk.org Mon Jul 7 13:07:56 2025 From: duke at openjdk.org (Andrej =?UTF-8?B?UGXEjWltw7p0aA==?=) Date: Mon, 7 Jul 2025 13:07:56 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal In-Reply-To: References: Message-ID: On Tue, 27 May 2025 12:10:29 GMT, Doug Simon wrote: >> This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. > > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/code/BytecodeFrame.java line 397: > >> 395: * @return a copy of the slot kinds array >> 396: */ >> 397: public JavaKind[] getSlotKinds() { > > Keep in mind that `slotKinds` is being [converted](https://github.com/openjdk/jdk/pull/25442/files#diff-43834727ed7dcd5128c10238ba56963c7d8feb66578b036c75dcf734bfa2ec92R80) to a List. In that context, can we return the list without making a copy? Or is the caller expected to be able to mutate the return value? Applied the changes provided by @mur47x111. > src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/code/VirtualObject.java line 325: > >> 323: * values have not been initialized. >> 324: */ >> 325: public JavaKind[] getSlotKinds() { > > Same comments as for BytecodeFrame.getSlotKinds. This applies to all other non-primitive array return values added by this PR. I left `Object[]` as the return value of `EncodedSpeculationReason#getReason` since the array could contain null elements, preventing the use of an immutable list like everywhere else. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25433#discussion_r2189993024 PR Review Comment: https://git.openjdk.org/jdk/pull/25433#discussion_r2190003945 From dnsimon at openjdk.org Mon Jul 7 13:07:56 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 7 Jul 2025 13:07:56 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal In-Reply-To: References: Message-ID: On Sat, 24 May 2025 16:49:23 GMT, Andrej Pe?im?th wrote: > This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/code/BytecodeFrame.java line 397: > 395: * @return a copy of the slot kinds array > 396: */ > 397: public JavaKind[] getSlotKinds() { Keep in mind that `slotKinds` is being [converted](https://github.com/openjdk/jdk/pull/25442/files#diff-43834727ed7dcd5128c10238ba56963c7d8feb66578b036c75dcf734bfa2ec92R80) to a List. In that context, can we return the list without making a copy? Or is the caller expected to be able to mutate the return value? src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/code/VirtualObject.java line 293: > 291: return true; > 292: } > 293: if (o instanceof VirtualObject) { Rename `l` to `that` and use pattern instanceof: Suggestion: if (o instanceof VirtualObject that) { src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/code/VirtualObject.java line 325: > 323: * values have not been initialized. > 324: */ > 325: public JavaKind[] getSlotKinds() { Same comments as for BytecodeFrame.getSlotKinds. This applies to all other non-primitive array return values added by this PR. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/hotspot/HotSpotCompiledCode.java line 225: > 223: * Returns a copy of the array of {@link ResolvedJavaMethod} objects representing the methods > 224: * whose bytecodes were used as input to the compilation. If the compilation did not record > 225: * method dependencies, this method returns {@code null}. Otherwise, the first element of the null -> empty list? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25433#discussion_r2109017920 PR Review Comment: https://git.openjdk.org/jdk/pull/25433#discussion_r2109025913 PR Review Comment: https://git.openjdk.org/jdk/pull/25433#discussion_r2109034172 PR Review Comment: https://git.openjdk.org/jdk/pull/25433#discussion_r2109040017 From rrich at openjdk.org Mon Jul 7 13:23:46 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 7 Jul 2025 13:23:46 GMT Subject: RFR: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining [v2] In-Reply-To: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> References: <5IVOmLIL__iwmhFaKEP80YNP8kIN94owhC3SIU8ZF4U=.ab4061f3-e62f-4586-9118-7f84f246078d@github.com> Message-ID: On Fri, 4 Jul 2025 08:14:19 GMT, Richard Reingruber wrote: >> This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. >> >> Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. >> >> Testing: x86_64, ppc64 manually. Other major platforms as part of our CI testing. >> >> Failed inlining on x86_64 with TieredCompilation disabled: >> >> >> make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 >> >> [...] >> >> STDOUT: >> CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true >> @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) >> @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) >> @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) >> @ 1 java.lang.Object:: (1 bytes) inline (hot) >> @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) >> s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method >> s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) >> s @ 1 java.lang.StringBuffer::length (5 bytes) accessor >> @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method >> @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor >> 2025-07-02T09:25:53.396634900Z Attempt 1, found: false >> 2025-07-02T09:25:53.415673072Z Attempt 2, found: false >> 2025-07-02T09:25:53.418876867Z Attempt 3, found: false >> >> [...] > > Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision: > > Allow vm.debug Thanks for the reviews and feedback. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26033#issuecomment-3045085386 From rrich at openjdk.org Mon Jul 7 13:23:47 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 7 Jul 2025 13:23:47 GMT Subject: Integrated: 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining In-Reply-To: References: Message-ID: On Sun, 29 Jun 2025 15:26:14 GMT, Richard Reingruber wrote: > This PR adds CompileCommands to the test DumpThreadsWithEliminatedLock.java to force inlining of java/lang/String*.* methods. This will make inlining more stable to allow for the expected lock elimination based on c2 escape analysis. > > Forcing inlining of java/lang/StringBuffer.* wasn't sufficient on x86_64. With that the test still failed with TieredCompilation disabled. > > Testing: x86_64, ppc64 manually. Other major platforms as part of our CI testing. > > Failed inlining on x86_64 with TieredCompilation disabled: > > > make test TEST=com/sun/management/HotSpotDiagnosticMXBean/DumpThreadsWithEliminatedLock.java TEST_VM_OPTS="-XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=PrintInlining,DumpThreadsWithEliminatedLock.*" JTREG=TIMEOUT_FACTOR=0.1 > > [...] > > STDOUT: > CompileCommand: PrintInlining DumpThreadsWithEliminatedLock.* bool PrintInlining = true > @ 1 java.util.concurrent.atomic.AtomicBoolean::get (13 bytes) inline (hot) > @ 11 java.lang.StringBuffer:: (7 bytes) inline (hot) late inline succeeded (string method) > @ 3 java.lang.AbstractStringBuilder:: (39 bytes) inline (hot) > @ 1 java.lang.Object:: (1 bytes) inline (hot) > @ 16 java.lang.System::currentTimeMillis (0 bytes) (intrinsic) > s @ 19 java.lang.StringBuffer::append (13 bytes) failed to inline: already compiled into a big method > s @ 24 java.lang.StringBuffer::toString (44 bytes) inline (hot) late inline succeeded (string method) > s @ 1 java.lang.StringBuffer::length (5 bytes) accessor > @ 24 java.lang.String:: (98 bytes) failed to inline: already compiled into a big method > @ 30 java.util.concurrent.atomic.AtomicReference::set (6 bytes) accessor > 2025-07-02T09:25:53.396634900Z Attempt 1, found: false > 2025-07-02T09:25:53.415673072Z Attempt 2, found: false > 2025-07-02T09:25:53.418876867Z Attempt 3, found: false > > [...] This pull request has now been integrated. Changeset: fea73c1d Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/fea73c1d40441561a246f2a09a739367cfc197ea Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod 8360599: [TESTBUG] DumpThreadsWithEliminatedLock.java fails because of unstable inlining Reviewed-by: alanb, mdoerr, lmesnik ------------- PR: https://git.openjdk.org/jdk/pull/26033 From fgao at openjdk.org Mon Jul 7 13:27:40 2025 From: fgao at openjdk.org (Fei Gao) Date: Mon, 7 Jul 2025 13:27:40 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> Message-ID: On Mon, 7 Jul 2025 06:59:20 GMT, Xiaohong Gong wrote: > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 Since SuperWord assigns `T_SHORT` to `StoreC` early on https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 the entire propagation chain tends to use `T_SHORT` as well. This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3045113900 From chagedorn at openjdk.org Mon Jul 7 13:38:55 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 7 Jul 2025 13:38:55 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v34] In-Reply-To: References: Message-ID: On Thu, 5 Jun 2025 08:27:47 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 94 commits: > > - small fix > - Merge branch 'master' into JDK-8342692 > - review > - review > - Update test/micro/org/openjdk/bench/java/lang/foreign/HeapMismatchManualLoopTest.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopScaleOverflow.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopPredicatesClone.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoop.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningIntLoopWithLongChecksPredicates.java > > Co-authored-by: Christian Hagedorn > - Update src/hotspot/share/opto/loopnode.cpp > > Co-authored-by: Christian Hagedorn > - ... and 84 more: https://git.openjdk.org/jdk/compare/faf19abd...fd19ee84 I quickly ran through Emanuel's review comments. I think they all have been addressed. Added some follow-up suggestions on top but otherwise, it still looks good to me. I guess since Emanuel is short on time, it would be good to have another review from someone. src/hotspot/share/opto/c2_globals.hpp line 868: > 866: product(bool, ShortRunningLongLoop, true, DIAGNOSTIC, \ > 867: "long counted loop/long range checks: don't create loop nest if " \ > 868: "loop runs for small enough number of iterations. Long loop is" \ Suggestion: "loop runs for small enough number of iterations. Long loop is " \ src/hotspot/share/opto/loopnode.cpp line 1125: > 1123: } > 1124: > 1125: class NodeInSingleLoopBody : public NodeInLoopBody { After the suggested update by Emanuel to have a more general name, you could also consider moving it to `predicates.hpp` to the other implementing classes of `NodeInLoopBody`. src/hotspot/share/opto/loopnode.cpp line 1147: > 1145: CloneShortLoopPredicateVisitor(LoopNode* target_loop_head, > 1146: const NodeInSingleLoopBody& node_in_loop_body, > 1147: PhaseIdealLoop* phase) Suggestion: CloneShortLoopPredicateVisitor(LoopNode* target_loop_head, const NodeInSingleLoopBody& node_in_loop_body, PhaseIdealLoop* phase) test/hotspot/jtreg/compiler/longcountedloops/TestStressShortRunningLongCountedLoop.java line 35: > 33: * @build jdk.test.whitebox.WhiteBox > 34: * @run driver jdk.test.lib.helpers.ClassFileInstaller jdk.test.whitebox.WhiteBox > 35: * @run main/othervm -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI compiler.longcountedloops.TestStressShortRunningLongCountedLoop Probably a copy-paste error: You do not seem to be using the WhiteBox API for this test. You can then just use * @run driver compiler.longcountedloops.TestStressShortRunningLongCountedLoop instead. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-2993768522 PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2190022389 PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2190056593 PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2190094320 PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2190128934 From eastigeevich at openjdk.org Mon Jul 7 15:07:55 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 7 Jul 2025 15:07:55 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 17:11:19 GMT, Aleksey Shipilev wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Simplify requirement for debug build > > OK, are you able to bisect which change? This fix to only do debug VM needs to be correctly linked to the actual cause, IMO. @shipilev, @theRealAph Any comments on the new version? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3045552156 From fjiang at openjdk.org Mon Jul 7 15:08:40 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 7 Jul 2025 15:08:40 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v2] In-Reply-To: <7KGEqkzMGveZ_lLtIcC0YwwHqmUri7L3_v7J6aVLmQM=.089fc97c-f09a-4220-87cc-a30d6dd10536@github.com> References: <7KGEqkzMGveZ_lLtIcC0YwwHqmUri7L3_v7J6aVLmQM=.089fc97c-f09a-4220-87cc-a30d6dd10536@github.com> Message-ID: On Sun, 6 Jul 2025 13:18:06 GMT, Feilong Jiang wrote: >> src/hotspot/share/c1/c1_Compiler.cpp line 240: >> >>> 238: #endif >>> 239: case vmIntrinsics::_getObjectSize: >>> 240: #if defined(X86) || defined(AARCH64) || defined(S390) || defined(RISCV64) || defined(PPC64) >> >> PS: The change of macro `RISCV` seems unrelated to this PR? Seem better to go with another PR. > > Reverted. Here is the seperate PR: https://github.com/openjdk/jdk/pull/26161 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2190358908 From fjiang at openjdk.org Mon Jul 7 15:09:18 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 7 Jul 2025 15:09:18 GMT Subject: RFR: 8361504: RISC-V: Make C1 clone intrinsic platform guard more specific Message-ID: <16DUz5Iytmw9i7wAxTx_oU4eeJBCsOI_15qzFP6M4GU=.8a5305a3-2b8d-4b9d-957a-430600bff4b4@github.com> Hi all. Please review this trivial patch, which changes the C1 primitive array clone intrinsic RISCV platform guard into RISCV64. As we only support RISCV64 for now. ------------- Commit messages: - RISC-V: Make C1 clone intrinsic macro guard more accurate Changes: https://git.openjdk.org/jdk/pull/26161/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26161&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361504 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26161.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26161/head:pull/26161 PR: https://git.openjdk.org/jdk/pull/26161 From shade at openjdk.org Mon Jul 7 15:19:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 7 Jul 2025 15:19:38 GMT Subject: RFR: 8361397: Rework CompileLog list synchronization [v2] In-Reply-To: References: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> Message-ID: On Mon, 7 Jul 2025 09:07:32 GMT, Aleksey Shipilev wrote: >> I want to remove `CompileTaskAlloc_lock` completely with [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473), and for that we need to fix a stray use of that lock in CompileLog list linkage. We can rewrite that part to atomics. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8361397-compilelog-list > - Fix Still looking for reviewers :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26127#issuecomment-3045590870 From duke at openjdk.org Mon Jul 7 15:41:41 2025 From: duke at openjdk.org (guanqiang han) Date: Mon, 7 Jul 2025 15:41:41 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 12:11:46 GMT, Christian Hagedorn wrote: >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. > > src/hotspot/share/opto/escape.cpp line 981: > >> 979: if (!OptimizePtrCompare) { >> 980: return; >> 981: } > > Thanks for working on this! IIUC, having the bailout here will fail to reduce the phi which could be unexpected. Shouldn't we just return `UNKNOWN` from within `ConnectionGraph::optimize_ptr_compare()` when we run without `OptimizePtrCompare`? > > On a separate note, can you also add a regression test? Maybe you can also just add a run with `-XX:-OptimizePtrCompare` - maybe together with `-XX:+VerifyReduceAllocationMerges` for more verification - to `compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java`. > > @JohnTortugo you might also want to have a look at this. Thanks a lot for your suggestion! I took a closer look at the code, and I now fully agree that your approach is the better one. Returning UNKNOWN from optimize_ptr_compare() when OptimizePtrCompare is disabled makes the behavior more consistent and avoids skipping reduce_phi_on_cmp() entirely, which could lead to unexpected results or missed optimization opportunities. I appreciate your feedback and will move forward with this approach. Thanks again! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2190428742 From shade at openjdk.org Mon Jul 7 15:43:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 7 Jul 2025 15:43:41 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 08:18:56 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Reimplement checking algo without using debug info test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 55: > 53: private static String retInst = "ret"; > 54: private static String neededAddInst = "addsp,sp,#0x20"; > 55: private static String neededLdpInst = "ldpx29,x30,[sp,#16]"; Move these default inits down to `analyzer.contains("[MachCode]")` block, since it looks like it selects between two options based on `[MachCode]` presence. Something like: boolean disassembly = analyzer.contains("[MachCode]"); retInst = disassembly ? "ret" : "c0035fd6"; ... test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 179: > 177: String s = instrReverseIter.previous(); > 178: instrReverseIter.next(); > 179: if (instrReverseIter.previous().startsWith(neededAddInst)) { Multiple issues here: - Confusing: what's the use of `s`, did you mean to use it for `startsWith`? - Indenting is off ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2190397094 PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2190415085 From shade at openjdk.org Mon Jul 7 15:43:42 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 7 Jul 2025 15:43:42 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v3] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 15:31:49 GMT, Aleksey Shipilev wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Reimplement checking algo without using debug info > > test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 179: > >> 177: String s = instrReverseIter.previous(); >> 178: instrReverseIter.next(); >> 179: if (instrReverseIter.previous().startsWith(neededAddInst)) { > > Multiple issues here: > - Confusing: what's the use of `s`, did you mean to use it for `startsWith`? > - Indenting is off Overall, I think searching for this multi-instruction stencil gets fairly hairy with iterators. Would a more straight-forward looping work? int found = 0; for (int c = 0; c < instrs.size() - 2; c++) { if (instrs.get(c).startsWith(spinWaitInst) && instrs.get(c+1).startsWith(neededLdpInst) && instrs.get(c+2).startsWith(neededAddInst)) { found++; } } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2190431170 From shade at openjdk.org Mon Jul 7 15:47:44 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 7 Jul 2025 15:47:44 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 08:49:45 GMT, Andrew Haley wrote: >>> OK, are you able to bisect which change? This fix to only do debug VM needs to be correctly linked to the actual cause, IMO. >> >> >> >>> > It looks like `XX:+PrintAssembly` prints out debug info in release builds but `XX:CompileCommand=print` does not. I am switching back to `XX:+PrintAssembly`. >>> >>> That's not great. What info do you need, exactly? >> >> >> # {method} {0x0000ffff50400378} 'test' '()V' in 'compiler/onSpinWait/TestOnSpinWaitAArch64$Launcher' >> # [sp+0x20] (sp of caller) >> 0x0000ffff985731c0: ff83 00d1 | fd7b 01a9 | 2803 0018 | 8923 40b9 | 1f01 09eb >> >> 0x0000ffff985731d4: ;*synchronization entry >> ; - compiler.onSpinWait.TestOnSpinWaitAArch64$Launcher::test at -1 (line 224) >> 0x0000ffff985731d4: 2102 0054 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5 | 1f20 03d5 >> >> 0x0000ffff985731f0: ;*invokestatic onSpinWait {reexecute=0 rethrow=0 return_oop=0} >> ; - compiler.onSpinWait.TestOnSpinWaitAArch64$Launcher::test at 0 (line 224) >> 0x0000ffff985731f0: 1f20 03d5 | fd7b 41a9 | ff83 0091 >> >> 0x0000ffff985731fc: ; {poll_return} >> 0x0000ffff985731fc: 8817 40f9 | ff63 28eb | 4800 0054 | c003 5fd6 >> >> 0x0000ffff9857320c: ; {internal_word} >> 0x0000ffff9857320c: 88ff ff10 | 88a3 02f9 >> >> 0x0000ffff98573214: ; {runtime_call SafepointBlob} >> 0x0000ffff98573214: 5bc3 fe17 >> >> 0x0000ffff98573218: ; {runtime_call Stub::method_entry_barrier} >> 0x0000ffff98573218: 0850 96d2 | 480a b3f2 | e8ff dff2 | 0001 3fd6 | ecff ff17 >> >> >> The test searches for `- compiler.onSpinWait.TestOnSpinWaitAArch64$Launcher::test at 0` and `invokestatic onSpinWait`. They identify the place where to search instructions. >> >> Assembly from all builds always has `{poll_return}`. I can use it as a search point. > >> ``` >> >> ``` >> >> >> >> >> >> The test searches for `- compiler.onSpinWait.TestOnSpinWaitAArch64$Launcher::test at 0` and `invokestatic onSpinWait`. They identify the place where to search instructions. > > That's not great. C2 is free to move stuff around, so it's not certain this test will keep working. If you just want to make sure that the pattern is used, a block_comment() would be more reliable. Then again, it does not address a central point: the test keeps relying on particular instruction scheduling, which is not reliable. Why do we even need to search for ldp+add anchor? Can we just blindly search for spin-wait-looking instructions? I would expect `block_comment` to be even more reliable, as @theRealAph suggested. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3045677596 From cslucas at openjdk.org Mon Jul 7 16:30:39 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Mon, 7 Jul 2025 16:30:39 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 12:11:46 GMT, Christian Hagedorn wrote: >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. > > src/hotspot/share/opto/escape.cpp line 981: > >> 979: if (!OptimizePtrCompare) { >> 980: return; >> 981: } > > Thanks for working on this! IIUC, having the bailout here will fail to reduce the phi which could be unexpected. Shouldn't we just return `UNKNOWN` from within `ConnectionGraph::optimize_ptr_compare()` when we run without `OptimizePtrCompare`? > > On a separate note, can you also add a regression test? Maybe you can also just add a run with `-XX:-OptimizePtrCompare` - maybe together with `-XX:+VerifyReduceAllocationMerges` for more verification - to `compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java`. > > @JohnTortugo you might also want to have a look at this. Thanks for the ping @chhagedorn and I fully agree with your comment. Actually, that's the correct way to do this. Thank you for fixing this @hgqxjj . ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2190566267 From eastigeevich at openjdk.org Mon Jul 7 18:19:54 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 7 Jul 2025 18:19:54 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Implement using block_comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/0b3320e6..2a209213 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=02-03 Stats: 60 lines in 2 files changed: 9 ins; 21 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From eastigeevich at openjdk.org Mon Jul 7 18:19:54 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 7 Jul 2025 18:19:54 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v2] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 15:45:16 GMT, Aleksey Shipilev wrote: > I would expect `block_comment` to be even more reliable I have rewrote the test to use `block_comment`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3046123161 From tschatzl at openjdk.org Mon Jul 7 18:57:41 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 7 Jul 2025 18:57:41 GMT Subject: RFR: 8350621: Code cache stops scheduling GC In-Reply-To: References: Message-ID: On Tue, 24 Jun 2025 08:52:12 GMT, Thomas Schatzl wrote: > > > > However, the current logic that a young-gc can cancel a full-gc (`_codecache_GC_aggressive` in this case) also seems surprising. > > That's a different issue. Actually most likely this is the issue for Parallel GC; that code is present only in older JDK versions before 25 (however other reasons like the `GCLocker` may also prevent these GCs), i.e. there should be no such issue in JDK 25 for Parallel GC. The situation for Parallel GC is different for earlier versions, i.e. for backporting: it would require the changes for [??JDK-8192647](https://bugs.openjdk.org/browse/JDK-8192647) and at least one other fix. There needs to be a cost/benefit analysis these are rather intrusive changes. @ajacob: > I considered a few different options before making this change: > > 1. Always call Universe::heap()->collect(...) without making any check (the GC impl should handle the situation) > 2. Fix all GCs implementation to ensure _unloading_threshold_gc_requested gets back to false at some point (probably what is supposed to happen today) > 3. Change CollectedHeap::collect to return a bool instead of void to indicate if GC was run or scheduled I had a spin at the (imo correct) fix for 2 - fix G1 `collect()` logic. Here's a diff: https://github.com/openjdk/jdk/compare/master...tschatzl:jdk:submit/8350621-code-cache-mgmt-hang?expand=1 What do you think? Thanks, Thomas ------------- PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-3035131232 From duke at openjdk.org Mon Jul 7 18:57:42 2025 From: duke at openjdk.org (Alexandre Jacob) Date: Mon, 7 Jul 2025 18:57:42 GMT Subject: RFR: 8350621: Code cache stops scheduling GC In-Reply-To: References: Message-ID: On Sun, 16 Feb 2025 18:39:29 GMT, Alexandre Jacob wrote: > The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`. > > This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling. > > Unfortunately this can't work properly under certain circumstances. > For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)). > > I have observed this behavior on JVM in version 21 that were migrated recently from java 17. > Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen. > > I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well. > > In order to reproduce this issue, I found a very simple and convenient way: > > > public class CodeCacheMain { > public static void main(String[] args) throws InterruptedException { > while (true) { > Thread.sleep(100); > } > } > } > > > Run this simple app with the following JVM flags: > > > -Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15 > > > - 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC > - low `ReservedCodeCacheSize` to put pressure on code cache quickly > - `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction > > Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will: > - allows us to monitor code cache > - indirectly generate activity on the code cache, just what we need to reproduce the bug > > Some logs related to code cache will show up at some point with GC activity: > > > [648.733s][info][codecache ] Triggering aggressive GC due to having only 14.970% free memory > > > And then it will stop and we'll end up with the following message: > > > [672.714s][info][codecache ] Code cache is full - disabling compilation > > > L... Hello, I'm sorry I didn't get back to you sooner on this PR. Indeed I considered the first option (do not try to prevent calls to `Universe::heap()->collect(...)`) but wanted to have something more elaborated instead. @tschatzl I like your proposal of fixing the GC implementation directly, as mentioned in my PR description it was my favorite option but because I found that this bug existed for at least Parallel GC and G1 I wanted to have something in CodeCache directly to ensure we never have an issue related to GC implementation. I had a look at your commit and feel like it is the good direction for G1. Thank you for having a look at it ------------- PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-3046216533 From eastigeevich at openjdk.org Mon Jul 7 20:57:55 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 7 Jul 2025 20:57:55 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v5] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Fix whitespace error ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/2a209213..e3163c9f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From snatarajan at openjdk.org Mon Jul 7 23:04:51 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Mon, 7 Jul 2025 23:04:51 GMT Subject: RFR: 8342941: IGV: Add new graph dumps for post loop, empty loop removal, and one iteration removal [v3] In-Reply-To: References: Message-ID: <_p5Jj77u1VyyW0eVneXqeNjmngTvSvFi94_FALv6swk=.d4e5aec1-dd73-48ed-8d7f-3080207be763@github.com> > This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). > > Changes: > - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. > - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. > - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. > > Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . > 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` > ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) > 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled > ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) > 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` > ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) > 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` > ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) > > Question to reviewers: > Are the new compiler phases OK, or should we change anything? > > Testing: > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) Saranya Natarajan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - fix 2 of review - Merge master - Addressing review comments - Initial Fix ------------- Changes: https://git.openjdk.org/jdk/pull/25756/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25756&range=02 Stats: 19 lines in 3 files changed: 19 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/25756.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25756/head:pull/25756 PR: https://git.openjdk.org/jdk/pull/25756 From kvn at openjdk.org Mon Jul 7 23:31:42 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 7 Jul 2025 23:31:42 GMT Subject: RFR: 8361397: Rework CompileLog list synchronization [v2] In-Reply-To: References: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> Message-ID: On Mon, 7 Jul 2025 09:07:32 GMT, Aleksey Shipilev wrote: >> I want to remove `CompileTaskAlloc_lock` completely with [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473), and for that we need to fix a stray use of that lock in CompileLog list linkage. We can rewrite that part to atomics. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8361397-compilelog-list > - Fix Okay ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26127#pullrequestreview-2995538169 From kvn at openjdk.org Mon Jul 7 23:37:40 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 7 Jul 2025 23:37:40 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 13:33:23 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Just use printf directly Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26090#pullrequestreview-2995553854 From kvn at openjdk.org Mon Jul 7 23:37:41 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 7 Jul 2025 23:37:41 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 05:58:51 GMT, Aleksey Shipilev wrote: >> test/hotspot/jtreg/testlibrary/ctw/src/sun/hotspot/tools/ctw/Compiler.java line 89: >> >>> 87: UNSAFE.ensureClassInitialized(aClass); >>> 88: } catch (NoClassDefFoundError e) { >>> 89: CompileTheWorld.OUT.printf("[%d]\t%s\tNOTE unable to init class : %s%n", >> >> Do you mean `\n` here and in all other outputs? `%n` needs local variable to store size of output. > > I meant `%n` :) > > You are probably thinking about C printf? In Java [formatters](https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html), `%n` is the "platform-specific line separator". It is more compatible than just `\n`, which runs into platform-specific `CR` vs `LF` vs `CRLF` line separator mess. > > See: > > > jshell> System.out.printf("Hello\nthere,\nVladimir!\n") > Hello > there, > Vladimir! > $6 ==> java.io.PrintStream at 34c45dca > > jshell> System.out.printf("Hello%nthere,%nVladimir!%n") > Hello > there, > Vladimir! > $7 ==> java.io.PrintStream at 34c45dca Now you know that I am not expert Java programmer :( ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26090#discussion_r2191193030 From kvn at openjdk.org Mon Jul 7 23:52:48 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 7 Jul 2025 23:52:48 GMT Subject: RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() Message-ID: `CodeBlob::relocation_size()` is calculated as `(_mutable_data + _relocation_size - _mutable_data)`. `CodeBlob::relocation_size()` is called during AOT code loading before we allocate space for mutable data (the size is used to find how big space should be allocated). The expression at that point is `(NULL + _relocation_size - NULL)` which returns correct result. But we should just return `_relocation_size` which is recorded anyway in AOT data. Added missing `_mutable_data = blob_end();` initialization when we restore AOT code blob. Fixed embarrassing typo in asserts. Tested: tier1-6,8,10,xcomp,stress ------------- Commit messages: - 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() Changes: https://git.openjdk.org/jdk/pull/26175/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26175&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360942 Stats: 7 lines in 2 files changed: 4 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/26175.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26175/head:pull/26175 PR: https://git.openjdk.org/jdk/pull/26175 From kvn at openjdk.org Mon Jul 7 23:55:38 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 7 Jul 2025 23:55:38 GMT Subject: RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 23:48:03 GMT, Vladimir Kozlov wrote: > `CodeBlob::relocation_size()` is calculated as `(_mutable_data + _relocation_size - _mutable_data)`. `CodeBlob::relocation_size()` is called during AOT code loading before we allocate space for mutable data (the size is used to find how big space should be allocated). The expression at that point is `(NULL + _relocation_size - NULL)` which returns correct result. But we should just return `_relocation_size` which is recorded anyway in AOT data. > > Added missing `_mutable_data = blob_end();` initialization when we restore AOT code blob. > > Fixed embarrassing typo in asserts. > > Tested: tier1-6,8,10,xcomp,stress @mbaesken, please verify that is passing ubsan testing now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26175#issuecomment-3046865545 From fyang at openjdk.org Tue Jul 8 00:42:38 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 8 Jul 2025 00:42:38 GMT Subject: RFR: 8361504: RISC-V: Make C1 clone intrinsic platform guard more specific In-Reply-To: <16DUz5Iytmw9i7wAxTx_oU4eeJBCsOI_15qzFP6M4GU=.8a5305a3-2b8d-4b9d-957a-430600bff4b4@github.com> References: <16DUz5Iytmw9i7wAxTx_oU4eeJBCsOI_15qzFP6M4GU=.8a5305a3-2b8d-4b9d-957a-430600bff4b4@github.com> Message-ID: On Mon, 7 Jul 2025 15:03:52 GMT, Feilong Jiang wrote: > Hi all. > Please review this trivial patch, which changes the C1 primitive array clone intrinsic RISCV platform guard into RISCV64. As we only support RISCV64 for now. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26161#pullrequestreview-2995642785 From kvn at openjdk.org Tue Jul 8 00:43:38 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Jul 2025 00:43:38 GMT Subject: RFR: 8360175: C2 crash: assert(edge_from_to(prior_use, n)) failed: before block local scheduling In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 09:47:09 GMT, Manuel H?ssig wrote: > The triggered assert is part of the schedule verification code that runs just before machine code is emitted. The debug output showed that a `leaPCompressedOopOffset` node was causing the assert, which suggested the peephole optimization introduced in #25471 as the cause. The failure proved quite difficult to reproduce. It failed more often on Windows and required `-XX:+UseKNLSetting` (forces code generation for Intel's Knights Landing platform), which forces `-XX:+OptoScheduling`. > > The root-cause is a subtle bug in the rewiring of the base edge of `leaP*` nodes in the `remove_redundant_lea` peephole. When the peephole removed a `decodeHeapOop_not_null` including a spill, it did not set the base edge of the `leaP*` node to the same node as the address edge, which is the intent of the peephole, but to the parent node of the spill. That is not catastrophic in most cases, but might reference another register slot, which causes this assert. Concretely, we see the following graph > > MemToRegSpillCopy > | | > | MemToRegSpillCopy > | | > DefiniinoSpillCopy | > | | > | decodeHeapOop_not_null > | | > leaPCompressedHeapOop > > gets rewired to > > MemToRegSpillCopy > | | > DefinitionSpillCopy | > | | > leaPCompressedHeapOop > > instead of > > MemToRegSpillCopy > | > DefinitionSpillCopy > / \ > leaPCompressedHeapOop > > > This PR fixes this by always setting the base edge of the `leaP*` node to the same node as the address edge. Unfortunately, I was not able to construct a regression test because of the difficulty of reproducing the bug. > > # Testing > > - [ ] Github Actions > - [x] tier1,tier2 plus internal testing on all Oracle supported platforms > - [x] tier3,tier4,tier5 plus internal testing on Linux and Windows x64 > - [ ] Runthese8H on `windows-x64-debug` (test that reliably produced the failure addressed in this PR) Seems fine. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26157#pullrequestreview-2995643836 From gcao at openjdk.org Tue Jul 8 01:46:37 2025 From: gcao at openjdk.org (Gui Cao) Date: Tue, 8 Jul 2025 01:46:37 GMT Subject: RFR: 8361504: RISC-V: Make C1 clone intrinsic platform guard more specific In-Reply-To: <16DUz5Iytmw9i7wAxTx_oU4eeJBCsOI_15qzFP6M4GU=.8a5305a3-2b8d-4b9d-957a-430600bff4b4@github.com> References: <16DUz5Iytmw9i7wAxTx_oU4eeJBCsOI_15qzFP6M4GU=.8a5305a3-2b8d-4b9d-957a-430600bff4b4@github.com> Message-ID: On Mon, 7 Jul 2025 15:03:52 GMT, Feilong Jiang wrote: > Hi all. > Please review this trivial patch, which changes the C1 primitive array clone intrinsic RISCV platform guard into RISCV64. As we only support RISCV64 for now. Thanks, Looks good to me. ------------- Marked as reviewed by gcao (Author). PR Review: https://git.openjdk.org/jdk/pull/26161#pullrequestreview-2995725907 From xgong at openjdk.org Tue Jul 8 01:55:41 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 8 Jul 2025 01:55:41 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> Message-ID: On Mon, 7 Jul 2025 06:59:20 GMT, Xiaohong Gong wrote: >> Have you measured the performance of this micro-benchmark on NEON machine? >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java#L251-L256 >> >> We added an limitation only for `int` before: >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L131-L134 >> >> Perhaps we also need to impose a similar limitation on `short` if the same regression occurs. > >> Have you measured the performance of this micro-benchmark on NEON machine? >> >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java#L251-L256 >> >> We added an limitation only for `int` before: >> >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L131-L134 >> >> Perhaps we also need to impose a similar limitation on `short` if the same regression occurs. > > Good catch, and thanks so much for your input @fg1417 ! I will test the performance and disable auto-vectorization for double to short casting if the performance has regression. > >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. > Hi @XiaohongGong, is there any way we can implement 2HF -> 2S and 2S -> 2HF in these match rules ? > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4697 > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4679 > > The `fcvtn` and `fcvtl` instructions do not support these arrangements. I was wondering if there is any other way we can implement these by any chance? Do you mean `2HF -> 2F` and `2F -> 2HF` ? Yes, it does not support the 32-bit arrangements. Vector conversion is a kind of lanewise vector operation. For such cases, we usually use the same arrangements with 64-bit vector size for 32-bit ones. That means we can reuse the `T4H` and `T4S` to implement it. Hence, current match rules can cover the conversions between `2HF` and `2F`. Consider there is no such conversion cases in Vector API, I didn't change the comment in the match rules. I think this may benefit auto-vectorization. Currently, do we have cases that can match these rules with SLP? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3047091009 From xgong at openjdk.org Tue Jul 8 01:58:45 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 8 Jul 2025 01:58:45 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> Message-ID: <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> On Mon, 7 Jul 2025 13:23:15 GMT, Fei Gao wrote: > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 > > > > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 > > Since SuperWord assigns `T_SHORT` to `StoreC` early on > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 > > > the entire propagation chain tends to use `T_SHORT` as well. > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3047094924 From kvn at openjdk.org Tue Jul 8 02:06:39 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Jul 2025 02:06:39 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache [v2] In-Reply-To: References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: On Fri, 4 Jul 2025 14:13:27 GMT, guanqiang han wrote: >> The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. >> >> This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. > > guanqiang han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - correct a compile error > - Merge remote-tracking branch 'upstream/master' into 8344548 > - 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache > > The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is > confusing and does not reflect the current implementation. > > This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. src/hotspot/share/runtime/globals.hpp line 1573: > 1571: \ > 1572: product(uintx, StartAggressiveSweepingAt, 10, \ > 1573: "Start aggressive sweeping if X[%] of the total code cache is free.")\ I suggest : "Start aggressive sweeping if less than X[%] of the total code cache is free.") ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26114#discussion_r2191326749 From dzhang at openjdk.org Tue Jul 8 02:35:28 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 8 Jul 2025 02:35:28 GMT Subject: RFR: 8361532: RISC-V: Several vector tests fail after JDK-8354383 Message-ID: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> Hi all, Please take a look and review this PR, thanks! After [JDK-8354383](https://bugs.openjdk.org/browse/JDK-8354383) , several test cases fail when fastdebug with RVV. The reason for the error is that riscv lacks CastVV with dst as the mask register. This PR adds the corresponding matching rules. ### Testing qemu-system with RVV: * [x] Run jdk_vector (fastdebug) * [x] Run compiler/vectorapi (fastdebug) ------------- Commit messages: - 8361532: RISC-V: Several vector tests fail after JDK-8354383 Changes: https://git.openjdk.org/jdk/pull/26178/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26178&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361532 Stats: 11 lines in 1 file changed: 11 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26178.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26178/head:pull/26178 PR: https://git.openjdk.org/jdk/pull/26178 From duke at openjdk.org Tue Jul 8 02:37:39 2025 From: duke at openjdk.org (guanqiang han) Date: Tue, 8 Jul 2025 02:37:39 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache [v2] In-Reply-To: References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: On Tue, 8 Jul 2025 02:04:01 GMT, Vladimir Kozlov wrote: >> guanqiang han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - correct a compile error >> - Merge remote-tracking branch 'upstream/master' into 8344548 >> - 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache >> >> The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is >> confusing and does not reflect the current implementation. >> >> This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. > > src/hotspot/share/runtime/globals.hpp line 1573: > >> 1571: \ >> 1572: product(uintx, StartAggressiveSweepingAt, 10, \ >> 1573: "Start aggressive sweeping if X[%] of the total code cache is free.")\ > > I suggest : "Start aggressive sweeping if less than X[%] of the total code cache is free.") Thank you very much for your valuable feedback. Your description is much clearer and more precise. I will update my PR accordingly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26114#discussion_r2191354527 From fyang at openjdk.org Tue Jul 8 02:51:37 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 8 Jul 2025 02:51:37 GMT Subject: RFR: 8361532: RISC-V: Several vector tests fail after JDK-8354383 In-Reply-To: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> References: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> Message-ID: On Tue, 8 Jul 2025 02:30:27 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8354383](https://bugs.openjdk.org/browse/JDK-8354383) , several test cases fail when fastdebug with RVV. > The reason for the error is that riscv lacks CastVV with dst as the mask register. > This PR adds the corresponding matching rules. > > ### Testing > qemu-system with RVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) Looks fine. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26178#pullrequestreview-2995814548 From fjiang at openjdk.org Tue Jul 8 03:16:38 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 8 Jul 2025 03:16:38 GMT Subject: RFR: 8361532: RISC-V: Several vector tests fail after JDK-8354383 In-Reply-To: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> References: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> Message-ID: <-9M65xCWu1KZZXcZDPhyRy3XkckmplWjlFuOTXmVHC8=.a4370b94-3ba5-499f-9eaf-f9ca66c18266@github.com> On Tue, 8 Jul 2025 02:30:27 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8354383](https://bugs.openjdk.org/browse/JDK-8354383) , several test cases fail when fastdebug with RVV. > The reason for the error is that riscv lacks CastVV with dst as the mask register. > This PR adds the corresponding matching rules. > > ### Testing > qemu-system with RVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) Looks good, thanks! ------------- Marked as reviewed by fjiang (Committer). PR Review: https://git.openjdk.org/jdk/pull/26178#pullrequestreview-2995842397 From duke at openjdk.org Tue Jul 8 03:26:18 2025 From: duke at openjdk.org (guanqiang han) Date: Tue, 8 Jul 2025 03:26:18 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache [v3] In-Reply-To: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: > The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. > > This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. guanqiang han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - make description more precise - Merge remote-tracking branch 'upstream/master' into 8344548 - correct a compile error - Merge remote-tracking branch 'upstream/master' into 8344548 - 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26114/files - new: https://git.openjdk.org/jdk/pull/26114/files/cb1b2c60..6b82418e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26114&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26114&range=01-02 Stats: 2179 lines in 96 files changed: 973 ins; 663 del; 543 mod Patch: https://git.openjdk.org/jdk/pull/26114.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26114/head:pull/26114 PR: https://git.openjdk.org/jdk/pull/26114 From duke at openjdk.org Tue Jul 8 04:03:38 2025 From: duke at openjdk.org (guanqiang han) Date: Tue, 8 Jul 2025 04:03:38 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache [v2] In-Reply-To: References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: On Tue, 8 Jul 2025 02:35:05 GMT, guanqiang han wrote: >> src/hotspot/share/runtime/globals.hpp line 1573: >> >>> 1571: \ >>> 1572: product(uintx, StartAggressiveSweepingAt, 10, \ >>> 1573: "Start aggressive sweeping if X[%] of the total code cache is free.")\ >> >> I suggest : "Start aggressive sweeping if less than X[%] of the total code cache is free.") > > Thank you very much for your valuable feedback. Your description is much clearer and more precise. I will update my PR accordingly. I?ve updated the PR based on your feedback?please kindly take another look when convenient?thanks a lot . ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26114#discussion_r2191425052 From duke at openjdk.org Tue Jul 8 06:00:38 2025 From: duke at openjdk.org (erifan) Date: Tue, 8 Jul 2025 06:00:38 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v10] In-Reply-To: References: Message-ID: > This patch optimizes the following patterns: > For integer types: > > (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) > => (VectorMaskCmp src1 src2 ncond) > (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) > => (VectorMaskCmp src1 src2 ncond) > > cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. > > For float and double types: > > (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > > cond can be eq or ne. > > Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: > > Benchmark Unit Before Score Error After Score Error Uplift > testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 > testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 > testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 > testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 > testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 > testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 > testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 > testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 > testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 > testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 > testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 > testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 > testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 > testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 > testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 > testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 > testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 > testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 > testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 > testCompareLTMaskNotInt ops/s 1672180.09 995.238142 2353757.863 853.774734 1.4 > testCompareLTMaskNotLong ops/s 856502.26... erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: - Align indentation - Merge branch 'master' into JDK-8354242 - Address more comments ATT. - Merge branch 'master' into JDK-8354242 - Support negating unsigned comparison for BoolTest::mask Added a static method `negate_mask(mask btm)` into BoolTest class to negate both signed and unsigned comparison. - Addressed some review comments - Merge branch 'master' into JDK-8354242 - Refactor the JTReg tests for compare.xor(maskAll) Also made a bit change to support pattern `VectorMask.fromLong()`. - Merge branch 'master' into JDK-8354242 - Refactor code Add a new function XorVNode::Ideal_XorV_VectorMaskCmp to do this optimization, making the code more modular. - ... and 7 more: https://git.openjdk.org/jdk/compare/c2a2adc8...db78dc43 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24674/files - new: https://git.openjdk.org/jdk/pull/24674/files/5ebdc572..db78dc43 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24674&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24674&range=08-09 Stats: 9269 lines in 462 files changed: 4528 ins; 2873 del; 1868 mod Patch: https://git.openjdk.org/jdk/pull/24674.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24674/head:pull/24674 PR: https://git.openjdk.org/jdk/pull/24674 From duke at openjdk.org Tue Jul 8 06:00:39 2025 From: duke at openjdk.org (erifan) Date: Tue, 8 Jul 2025 06:00:39 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 09:32:42 GMT, erifan wrote: >> src/hotspot/share/opto/vectornode.cpp line 2243: >> >>> 2241: !VectorNode::is_all_ones_vector(in2)) { >>> 2242: return nullptr; >>> 2243: } >> >> Suggestion: >> >> if (in1->Opcode() != Op_VectorMaskCmp || >> in1->outcnt() != 1 || >> !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || >> !VectorNode::is_all_ones_vector(in2)) { >> return nullptr; >> } >> >> Indentation for clarity. >> >> Currently, you exiting if one of these is the case: >> 1. Not `MaskCmp` >> 2. More than one use >> 3. predicate cannot be negated AND the vector is all ones. Can you explain this condition? Do you have tests for cases: >> - predicate negatable and vector not all ones >> - predircate not negatable and vector not all ones >> - predicate negatable and vector all ones >> - predicate not negatable and vectors all ones >> >> Why do you guard against `VectorNode::is_all_ones_vector(in2)` at all? >> >> The condition for 3. is easy to get wrong, so good testing is important here :) > > The current testing status for the conditions you listed: >> 1. Not MaskCmp. > > **No test for it, tested locally**, Because I think this condition is too straightforward. > >> 2. More than one use. > > **Tested**, see `VectorMaskCompareNotTest.java line 1118`. > >> predicate negatable and vector not all ones. > > **Tested**, see `VectorMaskCompareNotTest.java line 1126`. > >> predicate not negatable and vector not all ones. > > **No test for it**, because we have tests for `predicate not negatable` or `vector not all ones`. If either is `false`, return nullptr. > >> predicate negatable and vector all ones. > > **A lot of tests for it**. For example `VectorMaskCompareNotTest.java line 1014`. > >> predicate not negatable and vectors all ones. > > **Tested**, see `VectorMaskCompareNotTest.java line 1222`. > Indentation for clarity. Done. I think we have enough negative tests. Please take a look at this PR, thanks~ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2191550171 From chagedorn at openjdk.org Tue Jul 8 06:15:42 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 8 Jul 2025 06:15:42 GMT Subject: RFR: 8342941: IGV: Add new graph dumps for post loop, empty loop removal, and one iteration removal [v3] In-Reply-To: <_p5Jj77u1VyyW0eVneXqeNjmngTvSvFi94_FALv6swk=.d4e5aec1-dd73-48ed-8d7f-3080207be763@github.com> References: <_p5Jj77u1VyyW0eVneXqeNjmngTvSvFi94_FALv6swk=.d4e5aec1-dd73-48ed-8d7f-3080207be763@github.com> Message-ID: On Mon, 7 Jul 2025 23:04:51 GMT, Saranya Natarajan wrote: >> This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). >> >> Changes: >> - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. >> - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. >> - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. >> >> Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . >> 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` >> ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) >> 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled >> ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) >> 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` >> ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) >> 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` >> ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) >> >> Question to reviewers: >> Are the new compiler phases OK, or should we change anything? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - fix 2 of review > - Merge master > - Addressing review comments > - Initial Fix Update looks good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25756#pullrequestreview-2996119957 From kvn at openjdk.org Tue Jul 8 07:20:40 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Jul 2025 07:20:40 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache [v3] In-Reply-To: References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: <3cbbO8fFBbiVaNQXLtDWg29AFsf-_CqFMSJulIz4QUw=.3a4318bd-ae20-4082-abbd-7828450c50d3@github.com> On Tue, 8 Jul 2025 03:26:18 GMT, guanqiang han wrote: >> The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. >> >> This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. > > guanqiang han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - make description more precise > - Merge remote-tracking branch 'upstream/master' into 8344548 > - correct a compile error > - Merge remote-tracking branch 'upstream/master' into 8344548 > - 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache > > The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is > confusing and does not reflect the current implementation. > > This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26114#pullrequestreview-2996300301 From duke at openjdk.org Tue Jul 8 07:25:44 2025 From: duke at openjdk.org (duke) Date: Tue, 8 Jul 2025 07:25:44 GMT Subject: RFR: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache [v3] In-Reply-To: References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: On Tue, 8 Jul 2025 03:26:18 GMT, guanqiang han wrote: >> The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. >> >> This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. > > guanqiang han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - make description more precise > - Merge remote-tracking branch 'upstream/master' into 8344548 > - correct a compile error > - Merge remote-tracking branch 'upstream/master' into 8344548 > - 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache > > The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is > confusing and does not reflect the current implementation. > > This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. @hgqxjj Your change (at version 6b82418e4f4cdd40dd764d9657c27c6d08e5752e) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26114#issuecomment-3047687262 From chagedorn at openjdk.org Tue Jul 8 07:29:41 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 8 Jul 2025 07:29:41 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v5] In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 09:08:19 GMT, Aleksey Shipilev wrote: >> See bug for more discussion. >> >> This PR implements the "all the way" solution by removing the free list completely. It complements https://github.com/openjdk/jdk/pull/25364, and can go either first, or second. We will remerge the other one once either integrates. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` >> - [x] Linux AArch64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Also free the lock! > - Comments and indenting > - Basic deletion Nice catch! Small nit, otherwise, looks good to me, too. src/hotspot/share/compiler/compileTask.cpp line 84: > 82: > 83: CompileTask::~CompileTask() { > 84: if ((_method_holder != nullptr && JNIHandles::is_weak_global_handle(_method_holder))) { While moving the code, you can probably remove one pair of parentheses here: Suggestion: if (_method_holder != nullptr && JNIHandles::is_weak_global_handle(_method_holder)) { ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25409#pullrequestreview-2996298093 PR Review Comment: https://git.openjdk.org/jdk/pull/25409#discussion_r2191683566 From hgreule at openjdk.org Tue Jul 8 07:40:02 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 8 Jul 2025 07:40:02 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v3] In-Reply-To: References: Message-ID: > Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. > > Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. > > I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. > > Please review. Thanks. Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: re-add package, add methods to Run annotation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25988/files - new: https://git.openjdk.org/jdk/pull/25988/files/6822cca0..f352726e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25988&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25988&range=01-02 Stats: 9 lines in 2 files changed: 3 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/25988.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25988/head:pull/25988 PR: https://git.openjdk.org/jdk/pull/25988 From hgreule at openjdk.org Tue Jul 8 07:40:02 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 8 Jul 2025 07:40:02 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v2] In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 09:34:18 GMT, Manuel H?ssig wrote: >> Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: >> >> remove classfile version > > src/hotspot/share/opto/subnode.cpp line 2031: > >> 2029: case Op_ReverseBytesUS: return TypeInt::make(byteswap(static_cast(con->is_int()->get_con()))); >> 2030: case Op_ReverseBytesI: return TypeInt::make(byteswap(con->is_int()->get_con())); >> 2031: case Op_ReverseBytesL: return TypeLong::make(byteswap(con->is_long()->get_con())); > > Why are you dropping the `checked_cast` here? Were they just an abundance of caution before? This was basically from copy-pasting, but the cast was from jint to jint and jlong to jlong respectively. With the other checked_casts removed, it looks confusing to keep it there I think. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25988#discussion_r2191721589 From chagedorn at openjdk.org Tue Jul 8 07:50:39 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 8 Jul 2025 07:50:39 GMT Subject: RFR: 8361397: Rework CompileLog list synchronization [v2] In-Reply-To: References: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> Message-ID: On Mon, 7 Jul 2025 09:07:32 GMT, Aleksey Shipilev wrote: >> I want to remove `CompileTaskAlloc_lock` completely with [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473), and for that we need to fix a stray use of that lock in CompileLog list linkage. We can rewrite that part to atomics. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8361397-compilelog-list > - Fix Looks good to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26127#pullrequestreview-2996405511 From tschatzl at openjdk.org Tue Jul 8 07:50:45 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 8 Jul 2025 07:50:45 GMT Subject: RFR: 8350621: Code cache stops scheduling GC In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 18:54:55 GMT, Alexandre Jacob wrote: > Hello, I'm sorry I didn't get back to you sooner on this PR. > No worries, it should be rather me to not get to this earlier.... > Indeed I considered the first option (do not try to prevent calls to `Universe::heap()->collect(...)`) but wanted to have something more elaborated instead. > > @tschatzl I like your proposal of fixing the GC implementation directly, as mentioned in my PR description it was my favorite option but because I found that this bug existed for at least Parallel GC and G1 I wanted to have something in CodeCache directly to ensure we never have an issue related to GC implementation. I had a look at your commit and feel like it is the good direction for G1. Thank you for having a look at it First, I assume you verified my change ;) How do we proceed from here? Do you want to reuse this PR or should we (I, you?) open a new one for the new suggestion? What do you prefer? I am fine with either option. Thanks, Thomas ------------- PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-3047757426 From roland at openjdk.org Tue Jul 8 08:03:35 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 8 Jul 2025 08:03:35 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v35] In-Reply-To: References: Message-ID: > To optimize a long counted loop and long range checks in a long or int > counted loop, the loop is turned into a loop nest. When the loop has > few iterations, the overhead of having an outer loop whose backedge is > never taken, has a measurable cost. Furthermore, creating the loop > nest usually causes one iteration of the loop to be peeled so > predicates can be set up. If the loop is short running, then it's an > extra iteration that's run with range checks (compared to an int > counted loop with int range checks). > > This change doesn't create a loop nest when: > > 1- it can be determined statically at loop nest creation time that the > loop runs for a short enough number of iterations > > 2- profiling reports that the loop runs for no more than ShortLoopIter > iterations (1000 by default). > > For 2-, a guard is added which is implemented as yet another predicate. > > While this change is in principle simple, I ran into a few > implementation issues: > > - while c2 has a way to compute the number of iterations of an int > counted loop, it doesn't have that for long counted loop. The > existing logic for int counted loops promotes values to long to > avoid overflows. I reworked it so it now works for both long and int > counted loops. > > - I added a new deoptimization reason (Reason_short_running_loop) for > the new predicate. Given the number of iterations is narrowed down > by the predicate, the limit of the loop after transformation is a > cast node that's control dependent on the short running loop > predicate. Because once the counted loop is transformed, it is > likely that range check predicates will be inserted and they will > depend on the limit, the short running loop predicate has to be the > one that's further away from the loop entry. Now it is also possible > that the limit before transformation depends on a predicate > (TestShortRunningLongCountedLoopPredicatesClone is an example), we > can have: new predicates inserted after the transformation that > depend on the casted limit that itself depend on old predicates > added before the transformation. To solve this cicular dependency, > parse and assert predicates are cloned between the old predicates > and the loop head. The cloned short running loop parse predicate is > the one that's used to insert the short running loop predicate. > > - In the case of a long counted loop, the loop is transformed into a > regular loop with a new limit and transformed range checks that's > later turned into an in counted loop. The int ... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/c2_globals.hpp Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21630/files - new: https://git.openjdk.org/jdk/pull/21630/files/fd19ee84..f3ca08de Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=34 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=33-34 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/21630.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21630/head:pull/21630 PR: https://git.openjdk.org/jdk/pull/21630 From chagedorn at openjdk.org Tue Jul 8 08:05:42 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 8 Jul 2025 08:05:42 GMT Subject: RFR: 8360175: C2 crash: assert(edge_from_to(prior_use, n)) failed: before block local scheduling In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 09:47:09 GMT, Manuel H?ssig wrote: > The triggered assert is part of the schedule verification code that runs just before machine code is emitted. The debug output showed that a `leaPCompressedOopOffset` node was causing the assert, which suggested the peephole optimization introduced in #25471 as the cause. The failure proved quite difficult to reproduce. It failed more often on Windows and required `-XX:+UseKNLSetting` (forces code generation for Intel's Knights Landing platform), which forces `-XX:+OptoScheduling`. > > The root-cause is a subtle bug in the rewiring of the base edge of `leaP*` nodes in the `remove_redundant_lea` peephole. When the peephole removed a `decodeHeapOop_not_null` including a spill, it did not set the base edge of the `leaP*` node to the same node as the address edge, which is the intent of the peephole, but to the parent node of the spill. That is not catastrophic in most cases, but might reference another register slot, which causes this assert. Concretely, we see the following graph > > MemToRegSpillCopy > | | > | MemToRegSpillCopy > | | > DefiniinoSpillCopy | > | | > | decodeHeapOop_not_null > | | > leaPCompressedHeapOop > > gets rewired to > > MemToRegSpillCopy > | | > DefinitionSpillCopy | > | | > leaPCompressedHeapOop > > instead of > > MemToRegSpillCopy > | > DefinitionSpillCopy > / \ > leaPCompressedHeapOop > > > This PR fixes this by always setting the base edge of the `leaP*` node to the same node as the address edge. Unfortunately, I was not able to construct a regression test because of the difficulty of reproducing the bug. > > # Testing > > - [ ] Github Actions > - [x] tier1,tier2 plus internal testing on all Oracle supported platforms > - [x] tier3,tier4,tier5 plus internal testing on Linux and Windows x64 > - [ ] Runthese8H on `windows-x64-debug` (test that reliably produced the failure addressed in this PR) Marked as reviewed by chagedorn (Reviewer). src/hotspot/cpu/x86/peephole_x86_64.cpp line 349: > 347: Node* dependant_lea = decode->fast_out(i); > 348: if (dependant_lea->is_Mach() && dependant_lea->as_Mach()->ideal_Opcode() == Op_AddP) { > 349: dependant_lea->set_req(AddPNode::Base, lea_derived_oop->in(AddPNode::Address)); The fix looks reasonable to me, too. No worries about the regression test, thanks for trying! A small question: Why don't we use `lea_address`? Another thing I've noticed while browsing the code: `ra_` and `new_root` seem to be unused and could be removed (could probably also be squeezed into this PR here instead of creating a new issue just for that). ------------- PR Review: https://git.openjdk.org/jdk/pull/26157#pullrequestreview-2996452308 PR Review Comment: https://git.openjdk.org/jdk/pull/26157#discussion_r2191778511 From hgreule at openjdk.org Tue Jul 8 08:06:40 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 8 Jul 2025 08:06:40 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v2] In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 13:12:30 GMT, Manuel H?ssig wrote: >> Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: >> >> remove classfile version > > You forgot to add the new tests to the array of tests in `@Run`: > > stderr: Exception in thread "main" compiler.lib.ir_framework.shared.TestRunException: > > Test Failures (1) > ----------------- > Custom Run Test: @Run: runMethod - @Tests: {testI1,testI2,testI3,testL1,testL2,testL3,testS1,testS2,testS3,testUS1,testUS2,testUS3}: > compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method public void ReverseBytesConstantsTests.runMethod() > at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162) > at compiler.lib.ir_framework.test.AbstractTest.run(AbstractTest.java:100) > at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:89) > at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:865) > at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:255) > at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:168) > Caused by: java.lang.reflect.InvocationTargetException > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:119) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159) > ... 5 more > Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -24674 out of bounds for length 128 > at java.base/java.lang.Character.valueOf(Character.java:9284) > at ReverseBytesConstantsTests.assertResultUS(ReverseBytesConstantsTests.java:102) > at ReverseBytesConstantsTests.runMethod(ReverseBytesConstantsTests.java:66) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > ... 7 more > at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:901) > at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:255) > at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:168) Thanks @mhaessig. It seems like the methods don't need to be added there, as other methods were missing too. I updated the list nonetheless. Regarding the exception, I'm not sure what the expectations and guarantees are here. Does Java code have to expect inputs that are illegal in Java but legal in bytecode? I can work around this by casting the result to int explicitly (to compare `Integer` objects instead), but I feel like this is a deeper problem. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25988#issuecomment-3047806943 From bkilambi at openjdk.org Tue Jul 8 08:16:39 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 8 Jul 2025 08:16:39 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> Message-ID: On Tue, 8 Jul 2025 01:55:55 GMT, Xiaohong Gong wrote: >>> > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 >>> >>> Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. >> >> Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. >> >> This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 >> >> Since SuperWord assigns `T_SHORT` to `StoreC` early on https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 >> the entire propagation chain tends to use `T_SHORT` as well. >> >> This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. >> >> So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. > >> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 >> > >> > >> > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. >> >> Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. >> >> This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: >> >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 >> >> Since SuperWord assigns `T_SHORT` to `StoreC` early on >> >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 >> >> >> the entire propagation chain tends to use `T_SHORT` as well. >> This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. >> >> So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. > > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? > > Hi @XiaohongGong, is there any way we can implement 2HF -> 2S and 2S -> 2HF in these match rules ? > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4697 > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4679 > > > > The `fcvtn` and `fcvtl` instructions do not support these arrangements. I was wondering if there is any other way we can implement these by any chance? > > Do you mean `2HF -> 2F` and `2F -> 2HF` ? > > Yes, it does not support the 32-bit arrangements. Vector conversion is a kind of lanewise vector operation. For such cases, we usually use the same arrangements with 64-bit vector size for 32-bit ones. That means we can reuse the `T4H` and `T4S` to implement it. Hence, current match rules can cover the conversions between `2HF` and `2F`. > > Consider there is no such conversion cases in Vector API, I didn't change the comment in the match rules. I think this may benefit auto-vectorization. Currently, do we have cases that can match these rules with SLP? Sorry yes I meant 2HF <-> 2F. Yes, currently there are no such cases in VectorAPI as we do not support Float16 Vectors yet but this will benefit autovectorization cases. I think in this case this may also benefit 2D <-> 2HF as well (eventually we might add support for D <-> HF as well). Yes we have some JTREG tests that match these rules currently like - `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorConvChain.java`, `test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3047838866 From fgao at openjdk.org Tue Jul 8 08:21:41 2025 From: fgao at openjdk.org (Fei Gao) Date: Tue, 8 Jul 2025 08:21:41 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> Message-ID: On Tue, 8 Jul 2025 01:55:55 GMT, Xiaohong Gong wrote: > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 > > > > > > > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. > > > > > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 > > > > Since SuperWord assigns `T_SHORT` to `StoreC` early on > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 > > > > the entire propagation chain tends to use `T_SHORT` as well. > > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. > > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3047853525 From shade at openjdk.org Tue Jul 8 08:25:46 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 08:25:46 GMT Subject: RFR: 8361397: Rework CompileLog list synchronization [v2] In-Reply-To: References: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> Message-ID: <2YQBzDWuLDUUS78gQiO780XFkZ3Hy0zbD1SKKBBlJ48=.5f3a09e7-b667-4e6d-9ead-59eb4c7cb87c@github.com> On Mon, 7 Jul 2025 09:07:32 GMT, Aleksey Shipilev wrote: >> I want to remove `CompileTaskAlloc_lock` completely with [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473), and for that we need to fix a stray use of that lock in CompileLog list linkage. We can rewrite that part to atomics. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into JDK-8361397-compilelog-list > - Fix Thank you! Here goes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26127#issuecomment-3047865696 From shade at openjdk.org Tue Jul 8 08:25:46 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 08:25:46 GMT Subject: Integrated: 8361397: Rework CompileLog list synchronization In-Reply-To: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> References: <12Yp6QmpXqG-1UXTS8VveJ4yDNnDEGFV2q3_vRc3lF0=.4ccf05e2-9249-4b55-b48f-4f7fc17bef65@github.com> Message-ID: On Fri, 4 Jul 2025 09:23:28 GMT, Aleksey Shipilev wrote: > I want to remove `CompileTaskAlloc_lock` completely with [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473), and for that we need to fix a stray use of that lock in CompileLog list linkage. We can rewrite that part to atomics. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler` This pull request has now been integrated. Changeset: 7b255b8a Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/7b255b8a625ce1eda1ec6242b8e438691f6cc845 Stats: 11 lines in 2 files changed: 4 ins; 0 del; 7 mod 8361397: Rework CompileLog list synchronization Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/26127 From aph at openjdk.org Tue Jul 8 08:31:46 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 8 Jul 2025 08:31:46 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 18:19:54 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Implement using block_comment src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 6806: > 6804: > 6805: void MacroAssembler::spin_wait() { > 6806: block_comment("spin_wait"); You need something unique. e.g. Suggestion: block_comment("spin_wait_ohthoo8H"); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2191835912 From duke at openjdk.org Tue Jul 8 08:41:59 2025 From: duke at openjdk.org (Andrej =?UTF-8?B?UGXEjWltw7p0aA==?=) Date: Tue, 8 Jul 2025 08:41:59 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal [v2] In-Reply-To: References: Message-ID: > This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. Andrej Pe?im?th has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: JVMCI refactorings to enable replay compilation in Graal. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25433/files - new: https://git.openjdk.org/jdk/pull/25433/files/c12f507c..c6c3bb62 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25433&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25433&range=00-01 Stats: 204 lines in 10 files changed: 3 ins; 88 del; 113 mod Patch: https://git.openjdk.org/jdk/pull/25433.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25433/head:pull/25433 PR: https://git.openjdk.org/jdk/pull/25433 From duke at openjdk.org Tue Jul 8 08:41:59 2025 From: duke at openjdk.org (Andrej =?UTF-8?B?UGXEjWltw7p0aA==?=) Date: Tue, 8 Jul 2025 08:41:59 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal [v2] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:37:53 GMT, Andrej Pe?im?th wrote: >> This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. > > Andrej Pe?im?th has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > JVMCI refactorings to enable replay compilation in Graal. src/jdk.internal.vm.ci/share/classes/jdk/vm/ci/code/BytecodeFrame.java line 227: > 225: this.duringCall = duringCall; > 226: this.values = values; > 227: this.slotKinds = listFromTrustedArray(slotKinds); `ImmutableCollections#listFromTrustedArray` asserts that the array class is `Object[].class`, but the type of `slotKinds` is `JavaKind[]` - so these refactorings do not work. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25433#discussion_r2190153203 From shade at openjdk.org Tue Jul 8 08:42:48 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 08:42:48 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:28:33 GMT, Andrew Haley wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Implement using block_comment > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 6806: > >> 6804: >> 6805: void MacroAssembler::spin_wait() { >> 6806: block_comment("spin_wait"); > > You need something unique. e.g. > Suggestion: > > block_comment("spin_wait_ohthoo8H"); Erm, I don't see why? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2191861699 From roland at openjdk.org Tue Jul 8 08:43:31 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 8 Jul 2025 08:43:31 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v34] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 13:36:31 GMT, Christian Hagedorn wrote: > I quickly ran through Emanuel's review comments. I think they all have been addressed. Added some follow-up suggestions on top but otherwise, it still looks good to me. Thanks for taking another look at this. I made the changes you recommended. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-3047926590 From roland at openjdk.org Tue Jul 8 08:43:31 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 8 Jul 2025 08:43:31 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v36] In-Reply-To: References: Message-ID: > To optimize a long counted loop and long range checks in a long or int > counted loop, the loop is turned into a loop nest. When the loop has > few iterations, the overhead of having an outer loop whose backedge is > never taken, has a measurable cost. Furthermore, creating the loop > nest usually causes one iteration of the loop to be peeled so > predicates can be set up. If the loop is short running, then it's an > extra iteration that's run with range checks (compared to an int > counted loop with int range checks). > > This change doesn't create a loop nest when: > > 1- it can be determined statically at loop nest creation time that the > loop runs for a short enough number of iterations > > 2- profiling reports that the loop runs for no more than ShortLoopIter > iterations (1000 by default). > > For 2-, a guard is added which is implemented as yet another predicate. > > While this change is in principle simple, I ran into a few > implementation issues: > > - while c2 has a way to compute the number of iterations of an int > counted loop, it doesn't have that for long counted loop. The > existing logic for int counted loops promotes values to long to > avoid overflows. I reworked it so it now works for both long and int > counted loops. > > - I added a new deoptimization reason (Reason_short_running_loop) for > the new predicate. Given the number of iterations is narrowed down > by the predicate, the limit of the loop after transformation is a > cast node that's control dependent on the short running loop > predicate. Because once the counted loop is transformed, it is > likely that range check predicates will be inserted and they will > depend on the limit, the short running loop predicate has to be the > one that's further away from the loop entry. Now it is also possible > that the limit before transformation depends on a predicate > (TestShortRunningLongCountedLoopPredicatesClone is an example), we > can have: new predicates inserted after the transformation that > depend on the casted limit that itself depend on old predicates > added before the transformation. To solve this cicular dependency, > parse and assert predicates are cloned between the old predicates > and the loop head. The cloned short running loop parse predicate is > the one that's used to insert the short running loop predicate. > > - In the case of a long counted loop, the loop is transformed into a > regular loop with a new limit and transformed range checks that's > later turned into an in counted loop. The int ... Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 97 commits: - review - Merge branch 'master' into JDK-8342692 - Update src/hotspot/share/opto/c2_globals.hpp Co-authored-by: Christian Hagedorn - small fix - Merge branch 'master' into JDK-8342692 - review - review - Update test/micro/org/openjdk/bench/java/lang/foreign/HeapMismatchManualLoopTest.java Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopScaleOverflow.java Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopPredicatesClone.java Co-authored-by: Christian Hagedorn - ... and 87 more: https://git.openjdk.org/jdk/compare/310ef856...bb69cc02 ------------- Changes: https://git.openjdk.org/jdk/pull/21630/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=35 Stats: 1619 lines in 26 files changed: 1541 ins; 22 del; 56 mod Patch: https://git.openjdk.org/jdk/pull/21630.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21630/head:pull/21630 PR: https://git.openjdk.org/jdk/pull/21630 From aph at openjdk.org Tue Jul 8 08:48:39 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 8 Jul 2025 08:48:39 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: References: Message-ID: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> On Tue, 8 Jul 2025 08:39:34 GMT, Aleksey Shipilev wrote: > Erm, I don't see why? To make it unique? I don't see why you don't see that we need to ensure that the string is unique. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2191876244 From epeter at openjdk.org Tue Jul 8 08:59:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 8 Jul 2025 08:59:44 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 05:56:58 GMT, erifan wrote: >> The current testing status for the conditions you listed: >>> 1. Not MaskCmp. >> >> **No test for it, tested locally**, Because I think this condition is too straightforward. >> >>> 2. More than one use. >> >> **Tested**, see `VectorMaskCompareNotTest.java line 1118`. >> >>> predicate negatable and vector not all ones. >> >> **Tested**, see `VectorMaskCompareNotTest.java line 1126`. >> >>> predicate not negatable and vector not all ones. >> >> **No test for it**, because we have tests for `predicate not negatable` or `vector not all ones`. If either is `false`, return nullptr. >> >>> predicate negatable and vector all ones. >> >> **A lot of tests for it**. For example `VectorMaskCompareNotTest.java line 1014`. >> >>> predicate not negatable and vectors all ones. >> >> **Tested**, see `VectorMaskCompareNotTest.java line 1222`. > >> Indentation for clarity. > > Done. > > I think we have enough negative tests. Please take a look at this PR, thanks~ Thanks for your answers @erifan ! Can you please answer these as well? > predicate cannot be negated AND the vector is all ones. Can you explain this condition? A code comment would be helpful for this case. I'm a little bit struggling to understand the bracket/negation here. > Why do you guard against VectorNode::is_all_ones_vector(in2) at all? Is this necessary? Why? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2191900155 From shade at openjdk.org Tue Jul 8 08:59:53 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 08:59:53 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v6] In-Reply-To: References: Message-ID: > See bug for more discussion. > > This PR implements the "all the way" solution by removing the free list completely. It complements https://github.com/openjdk/jdk/pull/25364, and can go either first, or second. We will remerge the other one once either integrates. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler` > - [x] Linux AArch64 server fastdebug, `all` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: - Task count atomic can be relaxed - Minor touchup in ~CompileTask - Purge CompileTaskAlloc_lock completely - Merge branch 'master' into JDK-8357473-compile-task-free-list - Merge branch 'master' into JDK-8357473-compile-task-free-list - Merge branch 'master' into JDK-8357473-compile-task-free-list - Merge branch 'master' into JDK-8357473-compile-task-free-list - Merge branch 'master' into JDK-8357473-compile-task-free-list - Also free the lock! - Comments and indenting - ... and 1 more: https://git.openjdk.org/jdk/compare/7b255b8a...684f83b7 ------------- Changes: https://git.openjdk.org/jdk/pull/25409/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25409&range=05 Stats: 134 lines in 6 files changed: 27 ins; 71 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/25409.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25409/head:pull/25409 PR: https://git.openjdk.org/jdk/pull/25409 From shade at openjdk.org Tue Jul 8 08:59:54 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 08:59:54 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v5] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 07:17:20 GMT, Christian Hagedorn wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: >> >> - Merge branch 'master' into JDK-8357473-compile-task-free-list >> - Merge branch 'master' into JDK-8357473-compile-task-free-list >> - Merge branch 'master' into JDK-8357473-compile-task-free-list >> - Merge branch 'master' into JDK-8357473-compile-task-free-list >> - Also free the lock! >> - Comments and indenting >> - Basic deletion > > src/hotspot/share/compiler/compileTask.cpp line 84: > >> 82: >> 83: CompileTask::~CompileTask() { >> 84: if ((_method_holder != nullptr && JNIHandles::is_weak_global_handle(_method_holder))) { > > While moving the code, you can probably remove one pair of parentheses here: > Suggestion: > > if (_method_holder != nullptr && JNIHandles::is_weak_global_handle(_method_holder)) { Sure, done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25409#discussion_r2191898167 From xgong at openjdk.org Tue Jul 8 09:03:42 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 8 Jul 2025 09:03:42 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> Message-ID: On Tue, 8 Jul 2025 08:18:57 GMT, Fei Gao wrote: > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 > > > > > > > > > > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. > > > > > > > > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. > > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 > > > > > > Since SuperWord assigns `T_SHORT` to `StoreC` early on > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 > > > > > > the entire propagation chain tends to use `T_SHORT` as well. > > > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. > > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. > > > > > > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? > > No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 Yes, I see. Thanks! What I mean is for cases that SLP will use the subword types, it will be actually `T_SHORT` for `T_CHAR` then? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3047998283 From xgong at openjdk.org Tue Jul 8 09:09:43 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 8 Jul 2025 09:09:43 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> Message-ID: <1bwEx4HAqqmfw9DrslwrZH1cYfIKi-5p9AgelJrIB94=.f46dd942-2102-4fd3-adfd-7f7ec3c3dbc0@github.com> On Tue, 8 Jul 2025 09:00:53 GMT, Xiaohong Gong wrote: >>> > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 >>> > > >>> > > >>> > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. >>> > >>> > >>> > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. >>> > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: >>> > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 >>> > >>> > Since SuperWord assigns `T_SHORT` to `StoreC` early on >>> > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 >>> > >>> > the entire propagation chain tends to use `T_SHORT` as well. >>> > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. >>> > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. >>> >>> Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? >> >> No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 > >> > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 >> > > > >> > > > >> > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. >> > > >> > > >> > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. >> > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: >> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 >> > > >> > > Since SuperWord assigns `T_SHORT` to `StoreC` early on >> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 >> > > >> > > the entire propagation chain tends to use `T_SHORT` as well. >> > > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. >> > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. >> > >> > >> > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? >> >> No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: >> >> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 > > Yes, I see. Thanks! What I mean is for cases that SLP will use the subword types, it will be actu... > > > Hi @XiaohongGong, is there any way we can implement 2HF -> 2S and 2S -> 2HF in these match rules ? > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4697 > > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4679 > > > > > > The `fcvtn` and `fcvtl` instructions do not support these arrangements. I was wondering if there is any other way we can implement these by any chance? > > > > > > Do you mean `2HF -> 2F` and `2F -> 2HF` ? > > Yes, it does not support the 32-bit arrangements. Vector conversion is a kind of lanewise vector operation. For such cases, we usually use the same arrangements with 64-bit vector size for 32-bit ones. That means we can reuse the `T4H` and `T4S` to implement it. Hence, current match rules can cover the conversions between `2HF` and `2F`. > > Consider there is no such conversion cases in Vector API, I didn't change the comment in the match rules. I think this may benefit auto-vectorization. Currently, do we have cases that can match these rules with SLP? > > Sorry yes I meant 2HF <-> 2F. Yes, currently there are no such cases in VectorAPI as we do not support Float16 Vectors yet but this will benefit autovectorization cases. I think in this case this may also benefit 2D <-> 2HF as well (eventually we might add support for D <-> HF as well). Yes we have some JTREG tests that match these rules currently like - `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorConvChain.java`, `test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java`. Thanks! So per my understanding, things that I just need is updating comment (e.g. `// 4HF to 4F`) of rules like `vcvtHFtoF`, right? For conversions between double and HF, we do not need any new rules as it will be actually `double -> float -> HF`, right? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3048019109 From shade at openjdk.org Tue Jul 8 09:11:50 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 09:11:50 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v6] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:59:53 GMT, Aleksey Shipilev wrote: >> See bug for more discussion. >> >> This PR implements the "all the way" solution by removing the free list completely. It complements https://github.com/openjdk/jdk/pull/25364, and can go either first, or second. We will remerge the other one once either integrates. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` >> - [x] Linux AArch64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: > > - Task count atomic can be relaxed > - Minor touchup in ~CompileTask > - Purge CompileTaskAlloc_lock completely > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Also free the lock! > - Comments and indenting > - ... and 1 more: https://git.openjdk.org/jdk/compare/7b255b8a...684f83b7 Also realized the Atomic can be relaxed, since nothing rides on its memory consistency effects. I am retesting to make sure. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25409#issuecomment-3048024663 From eastigeevich at openjdk.org Tue Jul 8 09:18:39 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 8 Jul 2025 09:18:39 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> References: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> Message-ID: <0rQQG6eMxxtYRbrUUwSd6xABIwCuYg2Yf-5RJXb6U78=.794c8979-07ba-4a93-9166-9faa3827f77e@github.com> On Tue, 8 Jul 2025 08:46:26 GMT, Andrew Haley wrote: >> Erm, I don't see why? > >> Erm, I don't see why? > > To make it unique? I don't see why you don't see that we need to ensure that the string is unique. @theRealAph, what do you think to have it in the format: `spin_wait_%INST-COUNT%_%INST_NAME%`? For example: `spin_wait_1_isb` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2191937598 From shade at openjdk.org Tue Jul 8 09:18:40 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 09:18:40 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: <0rQQG6eMxxtYRbrUUwSd6xABIwCuYg2Yf-5RJXb6U78=.794c8979-07ba-4a93-9166-9faa3827f77e@github.com> References: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> <0rQQG6eMxxtYRbrUUwSd6xABIwCuYg2Yf-5RJXb6U78=.794c8979-07ba-4a93-9166-9faa3827f77e@github.com> Message-ID: <196sEl_APNWLhyHVGJTQ0yrudbjUxu-qCMOI9siz7FM=.7ebf304d-c4f3-4406-bfd1-85770397240b@github.com> On Tue, 8 Jul 2025 09:13:53 GMT, Evgeny Astigeevich wrote: >>> Erm, I don't see why? >> >> To make it unique? I don't see why you don't see that we need to ensure that the string is unique. > > @theRealAph, what do you think to have it in the format: `spin_wait_%INST-COUNT%_%INST_NAME%`? > For example: `spin_wait_1_isb` But this is a generic macroAssembler block comment. It does not make sense to me to have a block comment that looks like a memory corrupted string, just to satisfy a single test. There should be a middle-ground here, e.g. `spin_wait {`, which looks reasonably enough as the block comment, and not anything else. Would also match nicely when we emit the closing `{` at the end of the instruction block. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2191941126 From eastigeevich at openjdk.org Tue Jul 8 09:23:44 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 8 Jul 2025 09:23:44 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: <196sEl_APNWLhyHVGJTQ0yrudbjUxu-qCMOI9siz7FM=.7ebf304d-c4f3-4406-bfd1-85770397240b@github.com> References: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> <0rQQG6eMxxtYRbrUUwSd6xABIwCuYg2Yf-5RJXb6U78=.794c8979-07ba-4a93-9166-9faa3827f77e@github.com> <196sEl_APNWLhyHVGJTQ0yrudbjUxu-qCMOI9siz7FM=.7ebf304d-c4f3-4406-bfd1-85770397240b@github.com> Message-ID: On Tue, 8 Jul 2025 09:15:32 GMT, Aleksey Shipilev wrote: >> @theRealAph, what do you think to have it in the format: `spin_wait_%INST-COUNT%_%INST_NAME%`? >> For example: `spin_wait_1_isb` > > But this is a generic macroAssembler block comment. It does not make sense to me to have a block comment that looks like a memory corrupted string, just to satisfy a single test. There should be a middle-ground here, e.g. `spin_wait {`, which looks reasonably enough as the block comment, and not anything else. Would also match nicely when we emit the closing `}` at the end of the instruction block. I like the idea of having `{`, `}`. IMO this should be informative: ;; spin_wait_3_nop { nop nop nop ;; } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2191957430 From adinn at openjdk.org Tue Jul 8 09:28:42 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 8 Jul 2025 09:28:42 GMT Subject: RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 23:48:03 GMT, Vladimir Kozlov wrote: > `CodeBlob::relocation_size()` is calculated as `(_mutable_data + _relocation_size - _mutable_data)`. `CodeBlob::relocation_size()` is called during AOT code loading before we allocate space for mutable data (the size is used to find how big space should be allocated). The expression at that point is `(NULL + _relocation_size - NULL)` which returns correct result. But we should just return `_relocation_size` which is recorded anyway in AOT data. > > Added missing `_mutable_data = blob_end();` initialization when we restore AOT code blob. > > Fixed embarrassing typo in asserts. > > Tested: tier1-6,8,10,xcomp,stress Looks good. ------------- Marked as reviewed by adinn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26175#pullrequestreview-2996756136 From shade at openjdk.org Tue Jul 8 09:38:42 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 09:38:42 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: References: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> <0rQQG6eMxxtYRbrUUwSd6xABIwCuYg2Yf-5RJXb6U78=.794c8979-07ba-4a93-9166-9faa3827f77e@github.com> <196sEl_APNWLhyHVGJTQ0yrudbjUxu-qCMOI9siz7FM=.7ebf304d-c4f3-4406-bfd1-85770397240b@github.com> Message-ID: On Tue, 8 Jul 2025 09:21:03 GMT, Evgeny Astigeevich wrote: >> But this is a generic macroAssembler block comment. It does not make sense to me to have a block comment that looks like a memory corrupted string, just to satisfy a single test. There should be a middle-ground here, e.g. `spin_wait {`, which looks reasonably enough as the block comment, and not anything else. Would also match nicely when we emit the closing `}` at the end of the instruction block. > > I like the idea of having `{`, `}`. > IMO this should be informative: > > ;; spin_wait_3_nop { > nop > nop > nop > ;; } Putting `_3_nop` might be convenient, but I think that would introduce too much hassle in macroAssembler? Just this looks okay to me: ;; spin_wait { nop nop nop ;; } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2191990275 From duke at openjdk.org Tue Jul 8 09:51:45 2025 From: duke at openjdk.org (erifan) Date: Tue, 8 Jul 2025 09:51:45 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:56:44 GMT, Emanuel Peter wrote: > predicate cannot be negated AND the vector is all ones. Can you explain this condition? Ok, I'll add a comment for it. > Why do you guard against VectorNode::is_all_ones_vector(in2) at all? Because one of the nodes in the supported patterns by this PR needs to be `MaskAll` or `Replicate`. And this function `VectorNode::is_all_ones_vector` just meets our check for `MaskAll` and `Replicate`. Actually I don't quite understand your question. I have two understandings: 1. Not all nodes that `VectorNode::is_all_ones_vector` returns true are `MaskAll` or `Replicate`, but other nodes that do not meet the conditions. 2. Here, it does not need to be a vector with every bit 1, it only needs to be an `all true` mask. Which one do you mean? Or something else? Thanks~ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2192019309 From duke at openjdk.org Tue Jul 8 10:23:43 2025 From: duke at openjdk.org (erifan) Date: Tue, 8 Jul 2025 10:23:43 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 09:49:30 GMT, erifan wrote: >> Thanks for your answers @erifan ! >> >> Can you please answer these as well? >> >>> predicate cannot be negated AND the vector is all ones. Can you explain this condition? >> >> A code comment would be helpful for this case. I'm a little bit struggling to understand the bracket/negation here. >> >>> Why do you guard against VectorNode::is_all_ones_vector(in2) at all? >> >> Is this necessary? Why? > >> predicate cannot be negated AND the vector is all ones. Can you explain this condition? > > Ok, I'll add a comment for it. > >> Why do you guard against VectorNode::is_all_ones_vector(in2) at all? > > Because one of the nodes in the supported patterns by this PR needs to be `MaskAll` or `Replicate`. And this function `VectorNode::is_all_ones_vector` just meets our check for `MaskAll` and `Replicate`. Actually I don't quite understand your question. I have two understandings: > 1. Not all nodes that `VectorNode::is_all_ones_vector` returns true are `MaskAll` or `Replicate`, but other nodes that do not meet the conditions. > 2. Here, it does not need to be a vector with every bit 1, it only needs to be an `all true` mask. > > Which one do you mean? Or something else? Thanks~ The purpose of this PR is optimizing the following kinds of patterns: XXXVector va, vb; va.compare(EQ, vb).not() And the generated IR of `va.compare(EQ, vb).not()` is `(XorVMask (VectorMaskCmp va vb EQ) (MaskAll -1))`. On platforms like aarch64 NEON, `MaskAll` is `Replicate`. And `MaskAll` and `Replicate` are both all ones vectors, so we do this check `VectorNode::is_all_ones_vector(in2)` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2192087590 From fgao at openjdk.org Tue Jul 8 10:36:50 2025 From: fgao at openjdk.org (Fei Gao) Date: Tue, 8 Jul 2025 10:36:50 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> Message-ID: On Tue, 8 Jul 2025 09:00:53 GMT, Xiaohong Gong wrote: > > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 > > > > > > > > > > > > > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. > > > > > > > > > > > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. > > > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 > > > > > > > > Since SuperWord assigns `T_SHORT` to `StoreC` early on > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 > > > > > > > > the entire propagation chain tends to use `T_SHORT` as well. > > > > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. > > > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. > > > > > > > > > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? > > > > > > No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 > > Yes, I see. Thanks! What I mean is for cases that SLP will use the subword types, it will be actually `T_SHORT` for `T_CHAR` then? >From my side, the cases where SLP uses subword types can be roughly categorized into two groups: 1. Cases where the compiler doesn?t need to preserve the higher-order bits ? in these, SuperWord will use `T_SHORT` instead of `T_CHAR`. 2. Cases where the compiler does need to preserve the higher-order bits, like `RShiftI`, `Abs`, and `ReverseBytesI` ? in these, `T_CHAR` is still used. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3048334299 From bkilambi at openjdk.org Tue Jul 8 10:57:40 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 8 Jul 2025 10:57:40 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <1bwEx4HAqqmfw9DrslwrZH1cYfIKi-5p9AgelJrIB94=.f46dd942-2102-4fd3-adfd-7f7ec3c3dbc0@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> <1bwEx4HAqqmfw9DrslwrZH1cYfIKi-5p9AgelJrIB94=.f46dd942-2102-4fd3-adfd-7f7ec3c3dbc0@github.com> Message-ID: On Tue, 8 Jul 2025 09:07:00 GMT, Xiaohong Gong wrote: > > > > Hi @XiaohongGong, is there any way we can implement 2HF -> 2S and 2S -> 2HF in these match rules ? > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4697 > > > > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4679 > > > > > > > > The `fcvtn` and `fcvtl` instructions do not support these arrangements. I was wondering if there is any other way we can implement these by any chance? > > > > > > > > > Do you mean `2HF -> 2F` and `2F -> 2HF` ? > > > Yes, it does not support the 32-bit arrangements. Vector conversion is a kind of lanewise vector operation. For such cases, we usually use the same arrangements with 64-bit vector size for 32-bit ones. That means we can reuse the `T4H` and `T4S` to implement it. Hence, current match rules can cover the conversions between `2HF` and `2F`. > > > Consider there is no such conversion cases in Vector API, I didn't change the comment in the match rules. I think this may benefit auto-vectorization. Currently, do we have cases that can match these rules with SLP? > > > > > > Sorry yes I meant 2HF <-> 2F. Yes, currently there are no such cases in VectorAPI as we do not support Float16 Vectors yet but this will benefit autovectorization cases. I think in this case this may also benefit 2D <-> 2HF as well (eventually we might add support for D <-> HF as well). Yes we have some JTREG tests that match these rules currently like - `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorConvChain.java`, `test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java`. > > Thanks! So per my understanding, things that I just need is updating comment (e.g. `// 4HF to 4F`) of rules like `vcvtHFtoF`, right? For conversions between double and HF, we do not need any new rules as it will be actually `double -> float -> HF`, right? Yes please and also for `4F to 4HF` case for `vcvtF2HF`. Thanks! As for the double to half float conversion - yes with the current infrastructure it would be ConvD2F -> ConvF2HF which will be autovectorized to generate corresponding vector nodes. Sooner or later, support for ConvD2HF (and its vectorized version) might be added upstream (support already available in `lworld+fp16` branch of Valhalla here - https://github.com/openjdk/valhalla/blob/0ed65b9a63405e950c411835120f0f36e326aaaa/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4535). You do not have to add any new rules now for this case. I was just hinting at possible D<->HF implementation in the future. As the max vector length was 64bits, I did not add any implementation for Neon vcvtD2HF or vcvtHF2D in Valhalla. Maybe we can do two `fcvtl/fcvtn` to convert D to F and then F to HF for this specific case but we can think about that later :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3048404345 From duke at openjdk.org Tue Jul 8 11:05:49 2025 From: duke at openjdk.org (duke) Date: Tue, 8 Jul 2025 11:05:49 GMT Subject: Withdrawn: 8354282: C2: more crashes in compiled code because of dependency on removed range check CastIIs In-Reply-To: References: Message-ID: On Thu, 10 Apr 2025 15:15:54 GMT, Roland Westrelin wrote: > This is a variant of 8332827. In 8332827, an array access becomes > dependent on a range check `CastII` for another array access. When, > after loop opts are over, that RC `CastII` was removed, the array > access could float and an out of bound access happened. With the fix > for 8332827, RC `CastII`s are no longer removed. > > With this one what happens is that some transformations applied after > loop opts are over widen the type of the RC `CastII`. As a result, the > type of the RC `CastII` is no longer narrower than that of its input, > the `CastII` is removed and the dependency is lost. > > There are 2 transformations that cause this to happen: > > - after loop opts are over, the type of the `CastII` nodes are widen > so nodes that have the same inputs but a slightly different type can > common. > > - When pushing a `CastII` through an `Add`, if of the type both inputs > of the `Add`s are non constant, then we end up widening the type > (the resulting `Add` has a type that's wider than that of the > initial `CastII`). > > There are already 3 types of `Cast` nodes depending on the > optimizations that are allowed. Either the `Cast` is floating > (`depends_only_test()` returns `true`) or pinned. Either the `Cast` > can be removed if it no longer narrows the type of its input or > not. We already have variants of the `CastII`: > > - if the Cast can float and be removed when it doesn't narrow the type > of its input. > > - if the Cast is pinned and be removed when it doesn't narrow the type > of its input. > > - if the Cast is pinned and can't be removed when it doesn't narrow > the type of its input. > > What we need here, I think, is the 4th combination: > > - if the Cast can float and can't be removed when it doesn't narrow > the type of its input. > > Anyway, things are becoming confusing with all these different > variants named in ways that don't always help figure out what > constraints one of them operate under. So I refactored this and that's > the biggest part of this change. The fix consists in marking `Cast` > nodes when their type is widen in a way that prevents them from being > optimized out. > > Tobias ran performance testing with a slightly different version of > this change and there was no regression. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/24575 From rrich at openjdk.org Tue Jul 8 11:31:46 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 8 Jul 2025 11:31:46 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Mon, 7 Jul 2025 07:30:19 GMT, David Briemann wrote: >> Implement more nodes for ppc that exist on other platforms. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > rename instruction, add extra predicate cond for type int src/hotspot/cpu/ppc/ppc.ad line 13601: > 13599: instruct vneg4I_reg(vecX dst, vecX src) %{ > 13600: match(Set dst (NegVI src)); > 13601: predicate(PowerArchitecturePPC64 >= 9 && Matcher::vector_element_basic_type(n) == T_INT); Why not also for `T_LONG` (using `vnegd`)? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2192238645 From epeter at openjdk.org Tue Jul 8 11:45:51 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 8 Jul 2025 11:45:51 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v10] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 06:00:38 GMT, erifan wrote: >> This patch optimizes the following patterns: >> For integer types: >> >> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) >> => (VectorMaskCmp src1 src2 ncond) >> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) >> => (VectorMaskCmp src1 src2 ncond) >> >> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. >> >> For float and double types: >> >> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) >> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) >> >> cond can be eq or ne. >> >> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Score Error After Score Error Uplift >> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 >> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 >> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 >> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 >> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 >> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 >> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 >> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 >> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 >> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 >> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 >> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 >> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 >> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 >> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 >> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 >> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 >> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 >> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 >> testCompareLTMaskNotInt ops/s 16721... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: > > - Align indentation > - Merge branch 'master' into JDK-8354242 > - Address more comments > > ATT. > - Merge branch 'master' into JDK-8354242 > - Support negating unsigned comparison for BoolTest::mask > > Added a static method `negate_mask(mask btm)` into BoolTest class to > negate both signed and unsigned comparison. > - Addressed some review comments > - Merge branch 'master' into JDK-8354242 > - Refactor the JTReg tests for compare.xor(maskAll) > > Also made a bit change to support pattern `VectorMask.fromLong()`. > - Merge branch 'master' into JDK-8354242 > - Refactor code > > Add a new function XorVNode::Ideal_XorV_VectorMaskCmp to do this > optimization, making the code more modular. > - ... and 7 more: https://git.openjdk.org/jdk/compare/f1740382...db78dc43 src/hotspot/share/opto/vectornode.cpp line 2241: > 2239: in1->outcnt() != 1 || > 2240: !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || > 2241: !VectorNode::is_all_ones_vector(in2)) { Suggestion: !in1->as_VectorMaskCmp()->predicate_can_be_negated() || !VectorNode::is_all_ones_vector(in2)) { Remove the indentation again, and the superfluous brackets too ;) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2192263236 From epeter at openjdk.org Tue Jul 8 11:45:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 8 Jul 2025 11:45:52 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 10:21:03 GMT, erifan wrote: >>> predicate cannot be negated AND the vector is all ones. Can you explain this condition? >> >> Ok, I'll add a comment for it. >> >>> Why do you guard against VectorNode::is_all_ones_vector(in2) at all? >> >> Because one of the nodes in the supported patterns by this PR needs to be `MaskAll` or `Replicate`. And this function `VectorNode::is_all_ones_vector` just meets our check for `MaskAll` and `Replicate`. Actually I don't quite understand your question. I have two understandings: >> 1. Not all nodes that `VectorNode::is_all_ones_vector` returns true are `MaskAll` or `Replicate`, but other nodes that do not meet the conditions. >> 2. Here, it does not need to be a vector with every bit 1, it only needs to be an `all true` mask. >> >> Which one do you mean? Or something else? Thanks~ > > The purpose of this PR is optimizing the following kinds of patterns: > > XXXVector va, vb; > va.compare(EQ, vb).not() > > And the generated IR of `va.compare(EQ, vb).not()` is `(XorVMask (VectorMaskCmp va vb EQ) (MaskAll -1))`. On platforms like aarch64 NEON, `MaskAll` is `Replicate`. And `MaskAll` and `Replicate` are both all ones vectors, so we do this check `VectorNode::is_all_ones_vector(in2)` Oh wow, my bad. I misunderstood the brackets! Instead of: !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || !VectorNode::is_all_ones_vector(in2)) { I read: !(in1->as_VectorMaskCmp()->predicate_can_be_negated() || !VectorNode::is_all_ones_vector(in2))) { That confused me a lot... absolutely my bad. Well actually then my indentation suggestion was terrible! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2192261081 From epeter at openjdk.org Tue Jul 8 11:45:52 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Tue, 8 Jul 2025 11:45:52 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: References: Message-ID: <2UzxnawLUtlwIr5aaEdTfn4OEMt_z1HTfAaDBHCeZFU=.a70d1360-574a-4ca9-adae-7dec030ed2b7@github.com> On Tue, 8 Jul 2025 11:41:01 GMT, Emanuel Peter wrote: >> The purpose of this PR is optimizing the following kinds of patterns: >> >> XXXVector va, vb; >> va.compare(EQ, vb).not() >> >> And the generated IR of `va.compare(EQ, vb).not()` is `(XorVMask (VectorMaskCmp va vb EQ) (MaskAll -1))`. On platforms like aarch64 NEON, `MaskAll` is `Replicate`. And `MaskAll` and `Replicate` are both all ones vectors, so we do this check `VectorNode::is_all_ones_vector(in2)` > > Oh wow, my bad. I misunderstood the brackets! > > Instead of: > > !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || > !VectorNode::is_all_ones_vector(in2)) { > > I read: > > !(in1->as_VectorMaskCmp()->predicate_can_be_negated() || > !VectorNode::is_all_ones_vector(in2))) { > > That confused me a lot... absolutely my bad. > > Well actually then my indentation suggestion was terrible! I made a new suggestion below. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2192263852 From dbriemann at openjdk.org Tue Jul 8 11:49:39 2025 From: dbriemann at openjdk.org (David Briemann) Date: Tue, 8 Jul 2025 11:49:39 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Tue, 8 Jul 2025 11:29:20 GMT, Richard Reingruber wrote: >> David Briemann has updated the pull request incrementally with one additional commit since the last revision: >> >> rename instruction, add extra predicate cond for type int > > src/hotspot/cpu/ppc/ppc.ad line 13601: > >> 13599: instruct vneg4I_reg(vecX dst, vecX src) %{ >> 13600: match(Set dst (NegVI src)); >> 13601: predicate(PowerArchitecturePPC64 >= 9 && Matcher::vector_element_basic_type(n) == T_INT); > > Why not also for `T_LONG` (using `vnegd`)? Because we have made the experience that the vector instructions for longs are very slow on PPC. So far all of them I tried or implemented were slower than the non-vectorized alternative. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2192276433 From duke at openjdk.org Tue Jul 8 11:52:49 2025 From: duke at openjdk.org (Guanqiang Han) Date: Tue, 8 Jul 2025 11:52:49 GMT Subject: Integrated: 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache In-Reply-To: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> References: <4Kb1CzIxoBR4DXR9htBr3NINCgUup9coKCNFurAi93c=.253f5490-2263-4b3d-b921-2737ead6bb0a@github.com> Message-ID: On Thu, 3 Jul 2025 11:29:02 GMT, Guanqiang Han wrote: > The flag StartAggressiveSweepingAt triggers aggressive code cache sweeping based on the percentage of free space in the entire code cache. The previous description referenced segmented vs non-segmented code cache, which is confusing and does not reflect the current implementation. > > This patch updates the flag description to clearly state that the threshold is based on the total code cache free percentage, regardless of segmentation. This pull request has now been integrated. Changeset: 27e6a4d2 Author: han gq Committer: Evgeny Astigeevich URL: https://git.openjdk.org/jdk/commit/27e6a4d2f7a4bdd12408e518e86aeb623f1c41bc Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod 8344548: Incorrect StartAggressiveSweepingAt doc for segmented code cache Reviewed-by: kvn, eastigeevich ------------- PR: https://git.openjdk.org/jdk/pull/26114 From mbaesken at openjdk.org Tue Jul 8 11:53:41 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 8 Jul 2025 11:53:41 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 14:44:03 GMT, Manuel H?ssig wrote: > `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. > > Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. > > Testing: > - [x] Github Actions > - [x] tier1, tier2 plus Oracle internal testing > - [x] `TestRedundantLea.java` on Alpine Linux Marked as reviewed by mbaesken (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26046#pullrequestreview-2997284362 From rrich at openjdk.org Tue Jul 8 11:54:44 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 8 Jul 2025 11:54:44 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Tue, 8 Jul 2025 11:47:28 GMT, David Briemann wrote: >> src/hotspot/cpu/ppc/ppc.ad line 13601: >> >>> 13599: instruct vneg4I_reg(vecX dst, vecX src) %{ >>> 13600: match(Set dst (NegVI src)); >>> 13601: predicate(PowerArchitecturePPC64 >= 9 && Matcher::vector_element_basic_type(n) == T_INT); >> >> Why not also for `T_LONG` (using `vnegd`)? > > Because we have made the experience that the vector instructions for longs are very slow on PPC. So far all of them I tried or implemented were slower than the non-vectorized alternative. Ok I remember that. Why have you implemented `VMINU` and `VMAXU` for `T_LONG` then? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2192288024 From eastigeevich at openjdk.org Tue Jul 8 12:06:48 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 8 Jul 2025 12:06:48 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> References: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> Message-ID: On Tue, 8 Jul 2025 08:46:26 GMT, Andrew Haley wrote: >> Erm, I don't see why? > >> Erm, I don't see why? > > To make it unique? I don't see why you don't see that we need to ensure that the string is unique. @theRealAph, are you okay with the latest variant Aleksey proposes? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2192323717 From rrich at openjdk.org Tue Jul 8 12:06:48 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 8 Jul 2025 12:06:48 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Mon, 7 Jul 2025 07:30:19 GMT, David Briemann wrote: >> Implement more nodes for ppc that exist on other platforms. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > rename instruction, add extra predicate cond for type int Looks good. Cheers, Richard. ------------- Marked as reviewed by rrich (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26115#pullrequestreview-2997336475 From dbriemann at openjdk.org Tue Jul 8 12:06:49 2025 From: dbriemann at openjdk.org (David Briemann) Date: Tue, 8 Jul 2025 12:06:49 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Mon, 7 Jul 2025 07:30:19 GMT, David Briemann wrote: >> Implement more nodes for ppc that exist on other platforms. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > rename instruction, add extra predicate cond for type int Thank you both for your reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26115#issuecomment-3048646363 From duke at openjdk.org Tue Jul 8 12:06:49 2025 From: duke at openjdk.org (duke) Date: Tue, 8 Jul 2025 12:06:49 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: <7MQio5zK35lI1RvCGMxNQ-Zl5z9v4IhHOklgMAijfr4=.4b056bf4-74a3-43e2-8c2d-21a69d0a980c@github.com> On Mon, 7 Jul 2025 07:30:19 GMT, David Briemann wrote: >> Implement more nodes for ppc that exist on other platforms. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > rename instruction, add extra predicate cond for type int @dbriemann Your change (at version b65400a92dee5db285065e415067c0a33885b1b7) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26115#issuecomment-3048649313 From dbriemann at openjdk.org Tue Jul 8 12:06:50 2025 From: dbriemann at openjdk.org (David Briemann) Date: Tue, 8 Jul 2025 12:06:50 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Tue, 8 Jul 2025 11:51:36 GMT, Richard Reingruber wrote: >> Because we have made the experience that the vector instructions for longs are very slow on PPC. So far all of them I tried or implemented were slower than the non-vectorized alternative. > > Ok I remember that. Why have you implemented `VMINU` and `VMAXU` for `T_LONG` then? I tried several nodes/instructions also for longs but most of them performed worse and I removed them. `VMINU` and `VMAXU` did at least perform the same or slightly better than the non-vectorized variant. So @TheRealMDoerr and I decided to keep them. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2192314419 From rrich at openjdk.org Tue Jul 8 12:06:50 2025 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 8 Jul 2025 12:06:50 GMT Subject: RFR: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI [v6] In-Reply-To: References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: On Tue, 8 Jul 2025 12:00:18 GMT, David Briemann wrote: >> Ok I remember that. Why have you implemented `VMINU` and `VMAXU` for `T_LONG` then? > > I tried several nodes/instructions also for longs but most of them performed worse and I removed them. > `VMINU` and `VMAXU` did at least perform the same or slightly better than the non-vectorized variant. So @TheRealMDoerr and I decided to keep them. I see. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26115#discussion_r2192320376 From mbaesken at openjdk.org Tue Jul 8 12:34:38 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 8 Jul 2025 12:34:38 GMT Subject: RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: <7-NO8JQHTXeQkWZ3jeGmH99czVhlluiU5xvhJSyMve4=.5d6424e0-c2af-45fe-89d9-599c1bd4a2fe@github.com> On Mon, 7 Jul 2025 23:48:03 GMT, Vladimir Kozlov wrote: > `CodeBlob::relocation_size()` is calculated as `(_mutable_data + _relocation_size - _mutable_data)`. `CodeBlob::relocation_size()` is called during AOT code loading before we allocate space for mutable data (the size is used to find how big space should be allocated). The expression at that point is `(NULL + _relocation_size - NULL)` which returns correct result. But we should just return `_relocation_size` which is recorded anyway in AOT data. > > Added missing `_mutable_data = blob_end();` initialization when we restore AOT code blob. > > Fixed embarrassing typo in asserts. > > Tested: tier1-6,8,10,xcomp,stress Marked as reviewed by mbaesken (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26175#pullrequestreview-2997443455 From mbaesken at openjdk.org Tue Jul 8 12:34:38 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Tue, 8 Jul 2025 12:34:38 GMT Subject: RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 23:53:00 GMT, Vladimir Kozlov wrote: > please verify that is passing ubsan testing now. Hi Vladimir, I do not see the mentioned ubsan error (codeBlob.hpp:235:97: runtime error: applying non-zero offset 16 to null pointer) after your patch was added! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26175#issuecomment-3048761849 From mhaessig at openjdk.org Tue Jul 8 12:47:45 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 8 Jul 2025 12:47:45 GMT Subject: RFR: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 12:42:05 GMT, Matthias Baesken wrote: >> `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. >> >> Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. >> >> Testing: >> - [x] Github Actions >> - [x] tier1, tier2 plus Oracle internal testing >> - [x] `TestRedundantLea.java` on Alpine Linux > > With your patch included, the issue is gone on our Linux Alpine test machine. Thank you for your reviews @MBaesken and @chhagedorn! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26046#issuecomment-3048800987 From mhaessig at openjdk.org Tue Jul 8 12:47:46 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 8 Jul 2025 12:47:46 GMT Subject: Integrated: 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 14:44:03 GMT, Manuel H?ssig wrote: > `TestRedundantLea.java#StringInflate` failed on Alpine Linux because fewer `DecodeHeapOop_not_null`s than expected are generated even though the expected reduction is still present. This PR fixes this. > > Unfortunately, this fix makes the test less precise. I filed [JDK-8361045](https://bugs.openjdk.org/browse/JDK-8361045) to fix this when the IR-framework allows for it. > > Testing: > - [x] Github Actions > - [x] tier1, tier2 plus Oracle internal testing > - [x] `TestRedundantLea.java` on Alpine Linux This pull request has now been integrated. Changeset: 2349304b Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/2349304bb108adb0d5d095e8212d36d99132b6bb Stats: 12 lines in 1 file changed: 2 ins; 6 del; 4 mod 8361040: compiler/codegen/TestRedundantLea.java#StringInflate fails with failed IR rules Co-authored-by: Matthias Baesken Reviewed-by: chagedorn, mbaesken ------------- PR: https://git.openjdk.org/jdk/pull/26046 From dbriemann at openjdk.org Tue Jul 8 13:01:58 2025 From: dbriemann at openjdk.org (David Briemann) Date: Tue, 8 Jul 2025 13:01:58 GMT Subject: Integrated: 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI In-Reply-To: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> References: <37e56JLghJ5HzAPPnkYlyhlvFbgpBURRO5zpHMg8_B8=.8dfc8729-ea36-4643-bbbf-e6330fbf11c7@github.com> Message-ID: <9OhN2EpxwEEHhgRs44pPmntXwtUBxipTH0CFbXOAX70=.4a712174-ec7f-4d1e-9a22-38260daf8950@github.com> On Thu, 3 Jul 2025 12:30:51 GMT, David Briemann wrote: > Implement more nodes for ppc that exist on other platforms. This pull request has now been integrated. Changeset: 5c67e3d6 Author: David Briemann Committer: Martin Doerr URL: https://git.openjdk.org/jdk/commit/5c67e3d6e573e5e1fc23f16b61e51fda7b3dd307 Stats: 107 lines in 4 files changed: 106 ins; 0 del; 1 mod 8361353: [PPC64] C2: Add nodes UMulHiL, CmpUL3, UMinV, UMaxV, NegVI Reviewed-by: mdoerr, rrich ------------- PR: https://git.openjdk.org/jdk/pull/26115 From mhaessig at openjdk.org Tue Jul 8 13:03:39 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 8 Jul 2025 13:03:39 GMT Subject: RFR: 8360175: C2 crash: assert(edge_from_to(prior_use, n)) failed: before block local scheduling In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:02:14 GMT, Christian Hagedorn wrote: >> The triggered assert is part of the schedule verification code that runs just before machine code is emitted. The debug output showed that a `leaPCompressedOopOffset` node was causing the assert, which suggested the peephole optimization introduced in #25471 as the cause. The failure proved quite difficult to reproduce. It failed more often on Windows and required `-XX:+UseKNLSetting` (forces code generation for Intel's Knights Landing platform), which forces `-XX:+OptoScheduling`. >> >> The root-cause is a subtle bug in the rewiring of the base edge of `leaP*` nodes in the `remove_redundant_lea` peephole. When the peephole removed a `decodeHeapOop_not_null` including a spill, it did not set the base edge of the `leaP*` node to the same node as the address edge, which is the intent of the peephole, but to the parent node of the spill. That is not catastrophic in most cases, but might reference another register slot, which causes this assert. Concretely, we see the following graph >> >> MemToRegSpillCopy >> | | >> | MemToRegSpillCopy >> | | >> DefiniinoSpillCopy | >> | | >> | decodeHeapOop_not_null >> | | >> leaPCompressedHeapOop >> >> gets rewired to >> >> MemToRegSpillCopy >> | | >> DefinitionSpillCopy | >> | | >> leaPCompressedHeapOop >> >> instead of >> >> MemToRegSpillCopy >> | >> DefinitionSpillCopy >> / \ >> leaPCompressedHeapOop >> >> >> This PR fixes this by always setting the base edge of the `leaP*` node to the same node as the address edge. Unfortunately, I was not able to construct a regression test because of the difficulty of reproducing the bug. >> >> # Testing >> >> - [ ] Github Actions >> - [x] tier1,tier2 plus internal testing on all Oracle supported platforms >> - [x] tier3,tier4,tier5 plus internal testing on Linux and Windows x64 >> - [ ] Runthese8H on `windows-x64-debug` (test that reliably produced the failure addressed in this PR) > > src/hotspot/cpu/x86/peephole_x86_64.cpp line 349: > >> 347: Node* dependant_lea = decode->fast_out(i); >> 348: if (dependant_lea->is_Mach() && dependant_lea->as_Mach()->ideal_Opcode() == Op_AddP) { >> 349: dependant_lea->set_req(AddPNode::Base, lea_derived_oop->in(AddPNode::Address)); > > The fix looks reasonable to me, too. No worries about the regression test, thanks for trying! A small question: Why don't we use `lea_address`? > > Another thing I've noticed while browsing the code: `ra_` and `new_root` seem to be unused and could be removed (could probably also be squeezed into this PR here instead of creating a new issue just for that). We cannot use `lea_address` because in case of a spill that also gets moved up one node to check if the lea and the decode point to the same grandparent. > Another thing I've noticed while browsing the code: ra_ and new_root seem to be unused and could be removed. These arguments come from the machinery that calls this out of the matcher. I am not too familiar with it, so my working assumption so far has been to keep the signature the same as the other peepholes, which seems logical since it is called by generated code. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26157#discussion_r2192462595 From mchevalier at openjdk.org Tue Jul 8 13:38:26 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 13:38:26 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v3] In-Reply-To: References: Message-ID: > When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 > > This is enforced by restoring the old state, like in > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 > > That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: > > ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) > > > Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. > > Another situation is somewhat worse, when happening during parsing. It can lead to such cases: > > ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) > > The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? > > This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 > > And here there is the challenge: > - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) > - we can't really change the pointer, just the content > -... Marc Chevalier has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits: - Merge remote-tracking branch 'origin/master' into fix/too-many-ctrl-successor-after-intrinsic - Address comments - Remove useless loop - whoops Forgot to remove a bit, and restore sp - Urgh - Adapt test - Re-try - Fix test - Trying something ------------- Changes: https://git.openjdk.org/jdk/pull/25936/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=02 Stats: 349 lines in 7 files changed: 230 ins; 50 del; 69 mod Patch: https://git.openjdk.org/jdk/pull/25936.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25936/head:pull/25936 PR: https://git.openjdk.org/jdk/pull/25936 From mchevalier at openjdk.org Tue Jul 8 13:38:26 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 13:38:26 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v2] In-Reply-To: References: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> Message-ID: On Mon, 7 Jul 2025 07:28:37 GMT, Tobias Hartmann wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove useless loop > > src/hotspot/share/opto/library_call.cpp line 2376: > >> 2374: state.map = clone_map(); >> 2375: for (DUIterator_Fast imax, i = control()->fast_outs(imax); i < imax; i++) { >> 2376: Node* out = control()->fast_out(i); > > Could we have a similar issue with non-control users? For example, couldn't we also have stray memory users after bailout? We could, but it should be relatively harmless. Control is more annoying to have more than one successor. > src/hotspot/share/opto/library_call.cpp line 2393: > >> 2391: Node* out = control()->fast_out(i); >> 2392: if (out->is_CFG() && out->in(0) == control() && out != map() && !state.ctrl_succ.member(out)) { >> 2393: out->set_req(0, C->top()); > > Could `out` already be in the GVN hash ("remove node from hash table before modifying it")? I've added it, since indeed, it could. As far as I understand, I was just lucky in the situation where it happens, but there is no reason it would always hold. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2192553330 PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2192556138 From mchevalier at openjdk.org Tue Jul 8 13:44:45 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 13:44:45 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v2] In-Reply-To: References: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> Message-ID: <053aI6bIhjDWZgWQSx_KRGcDzAxg2DF86WNVBco2DkA=.7ccf19bf-3fa7-471d-929c-d20f01006235@github.com> On Mon, 7 Jul 2025 07:18:11 GMT, Tobias Hartmann wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove useless loop > > src/hotspot/share/opto/library_call.cpp line 1732: > >> 1730: return false; >> 1731: } >> 1732: destruct_map_clone(old_state.map); > > I think `destruct_map_clone` could be refactored to take a `SavedState`. I've made an override of destruct_map_clone taking a SavedState (and delegating to the existing one) rather than changing the existing one for the following reasons: - `destruct_map_clone` is in `GraphKit` so doesn't know about `SavedState`. Either I'd need to bring `SavedState` to the base class (useless visibility) (or something with forward declarations...) or move `destruct_map_clone` to the derived class `LibraryCallKit` - `destruct_map_clone` makes sense to have next to `clone_map`. But `clone_map` is used also in `GraphKit`, so not possible to move to the derived class - The existing `destruct_map_clone` doesn't need a `SavedState` and makes sense without. Requiring more information just make it less usable, but it's fine to have a thin adapter that one can by-pass if one has a `SafePointNode` and not a whole `SavedState`. > test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 104: > >> 102: public static final String START = "(\\d+(\\s){2}("; >> 103: public static final String MID = ".*)+(\\s){2}===.*"; >> 104: public static final String END = ")"; > > I don't like exposing these outside the IR framework but then again I don't really have an idea on how to check the "graph should not have both nodes" invariant. Maybe we should extend the `counts` annotation to support something like `@IR(counts = {IRNode.CallStaticJava, IRNode.OpaqueNotNull, "<= 1"} [...]`? As discussed, I now have a JBS issue about this, and I factored the duplicated regex into a single final String with name and comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2192570629 PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2192574207 From mchevalier at openjdk.org Tue Jul 8 13:53:25 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 13:53:25 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v4] In-Reply-To: References: Message-ID: > When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 > > This is enforced by restoring the old state, like in > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 > > That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: > > ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) > > > Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. > > Another situation is somewhat worse, when happening during parsing. It can lead to such cases: > > ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) > > The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? > > This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 > > And here there is the challenge: > - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) > - we can't really change the pointer, just the content > -... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Somehow intellij doesn't remove empty indented line ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25936/files - new: https://git.openjdk.org/jdk/pull/25936/files/e57cf47f..09b24ec4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/25936.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25936/head:pull/25936 PR: https://git.openjdk.org/jdk/pull/25936 From mchevalier at openjdk.org Tue Jul 8 13:56:39 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 13:56:39 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v4] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 13:53:25 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Somehow intellij doesn't remove empty indented line I've addressed the comments, ready for a second pass! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25936#issuecomment-3049067672 From mchevalier at openjdk.org Tue Jul 8 14:26:28 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 14:26:28 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v4] In-Reply-To: References: Message-ID: > A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. > > ## Pure Functions > > Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. > > ## Scope > > We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. > > ## Implementation Overview > > We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. > > This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: typo ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25760/files - new: https://git.openjdk.org/jdk/pull/25760/files/7028b561..7f18c9f6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25760&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25760&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/25760.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25760/head:pull/25760 PR: https://git.openjdk.org/jdk/pull/25760 From mchevalier at openjdk.org Tue Jul 8 14:26:29 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 14:26:29 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v3] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 07:53:24 GMT, Tobias Hartmann wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> mostly comments > > src/hotspot/share/opto/divnode.cpp line 1522: > >> 1520: Node* super = CallLeafPureNode::Ideal(phase, can_reshape); >> 1521: if (super != nullptr) { >> 1522: return super; > > Can't we just do `return CallLeafPureNode::Ideal(phase, can_reshape);` at the end of `ModFNode::Ideal` instead of `return nullptr`? That's what we usually do in C2, for example in `CallStaticJavaNode::Ideal` -> `CallNode::Ideal`. Feels more natural to me and would avoid the `super != nullptr` check. Also for the other `Ideal` methods that you modified. We could but it's not that direct. `ModFNode::Ideal` has 6 `return`s (without mine): - 3 are `return replace_with_con(...);` which in their turn return `nullptr` but after making changes in the graph. - 2 are `return nullptr;` - 1 is actually returning a node. And especially the final one is return replace_with_con(igvn, TypeF::make(jfloat_cast(xr))); If we change `replace_with_con` to actually return a `TupleNode` to do the job, we still have 2 places where to call the base class' `Ideal`. So I'm not sure how much better it would be to duplicate the call. It also adds a maintenance burden: if one adds another case where we don't want to make changes, one needs to add another call to `CallLeafPureNode::Ideal`. I think it's because of the structure of this function: rather than selecting cases where we want to do something and reaching the end with only the leftover cases, we select the cases we don't want to continue with, and we return early, making more cases where we should call the super method. I'll try something. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2192664653 From mchevalier at openjdk.org Tue Jul 8 14:26:29 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 14:26:29 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v4] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 08:30:00 GMT, Tobias Hartmann wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> typo > > src/hotspot/share/opto/divnode.cpp line 1528: > >> 1526: bool not_dead = proj_out_or_null(TypeFunc::Control) != nullptr; >> 1527: if (result_is_unused && not_dead) { >> 1528: return replace_with_con(igvn, TypeF::make(0.)); > > Can we replace all the other usages of `ModFloatingNode::replace_with_con` by `TupleNode` and get rid of that method? Yes, that sounds like a good idea. One still need to build the right `TupleNode`, which takes a bit of code. So in detail, I'd rather replace `replace_with_con` with a function returning the right `TupleNode` to be as concise on the call-site. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2192669986 From thartmann at openjdk.org Tue Jul 8 15:13:40 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 8 Jul 2025 15:13:40 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v2] In-Reply-To: References: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> Message-ID: On Tue, 8 Jul 2025 13:35:27 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/library_call.cpp line 2393: >> >>> 2391: Node* out = control()->fast_out(i); >>> 2392: if (out->is_CFG() && out->in(0) == control() && out != map() && !state.ctrl_succ.member(out)) { >>> 2393: out->set_req(0, C->top()); >> >> Could `out` already be in the GVN hash ("remove node from hash table before modifying it")? > > I've added it, since indeed, it could. As far as I understand, I was just lucky in the situation where it happens, but there is no reason it would always hold. I think you also need to re-insert it via `hash_find_insert`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2192795199 From thartmann at openjdk.org Tue Jul 8 15:16:42 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 8 Jul 2025 15:16:42 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v4] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 13:53:25 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Somehow intellij doesn't remove empty indented line src/hotspot/share/opto/library_call.cpp line 2507: > 2505: > 2506: if (alias_type->adr_type() == TypeInstPtr::KLASS || > 2507: alias_type->adr_type() == TypeAryPtr::RANGE) { The indentation is off here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2192803477 From thartmann at openjdk.org Tue Jul 8 15:16:44 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 8 Jul 2025 15:16:44 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v2] In-Reply-To: <053aI6bIhjDWZgWQSx_KRGcDzAxg2DF86WNVBco2DkA=.7ccf19bf-3fa7-471d-929c-d20f01006235@github.com> References: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> <053aI6bIhjDWZgWQSx_KRGcDzAxg2DF86WNVBco2DkA=.7ccf19bf-3fa7-471d-929c-d20f01006235@github.com> Message-ID: <5ezM7iZKoLDPycAuRWgVXAIOQ2d9ukvK28kxE7dklDE=.1740e1e7-5482-4239-99bf-75edecd41e6a@github.com> On Tue, 8 Jul 2025 13:42:14 GMT, Marc Chevalier wrote: >> test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 104: >> >>> 102: public static final String START = "(\\d+(\\s){2}("; >>> 103: public static final String MID = ".*)+(\\s){2}===.*"; >>> 104: public static final String END = ")"; >> >> I don't like exposing these outside the IR framework but then again I don't really have an idea on how to check the "graph should not have both nodes" invariant. Maybe we should extend the `counts` annotation to support something like `@IR(counts = {IRNode.CallStaticJava, IRNode.OpaqueNotNull, "<= 1"} [...]`? > > As discussed, I now have a JBS issue about this, and I factored the duplicated regex into a single final String with name and comment. Thanks, that looks good! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2192804630 From thartmann at openjdk.org Tue Jul 8 15:19:43 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 8 Jul 2025 15:19:43 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v3] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 14:19:23 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/divnode.cpp line 1522: >> >>> 1520: Node* super = CallLeafPureNode::Ideal(phase, can_reshape); >>> 1521: if (super != nullptr) { >>> 1522: return super; >> >> Can't we just do `return CallLeafPureNode::Ideal(phase, can_reshape);` at the end of `ModFNode::Ideal` instead of `return nullptr`? That's what we usually do in C2, for example in `CallStaticJavaNode::Ideal` -> `CallNode::Ideal`. Feels more natural to me and would avoid the `super != nullptr` check. Also for the other `Ideal` methods that you modified. > > We could but it's not that direct. `ModFNode::Ideal` has 6 `return`s (without mine): > - 3 are `return replace_with_con(...);` which in their turn return `nullptr` but after making changes in the graph. > - 2 are `return nullptr;` > - 1 is actually returning a node. > And especially the final one is > > return replace_with_con(igvn, TypeF::make(jfloat_cast(xr))); > > If we change `replace_with_con` to actually return a `TupleNode` to do the job, we still have 2 places where to call the base class' `Ideal`. So I'm not sure how much better it would be to duplicate the call. It also adds a maintenance burden: if one adds another case where we don't want to make changes, one needs to add another call to `CallLeafPureNode::Ideal`. I think it's because of the structure of this function: rather than selecting cases where we want to do something and reaching the end with only the leftover cases, we select the cases we don't want to continue with, and we return early, making more cases where we should call the super method. > > I'll try something. Ah, makes sense. Feel free to leave as-is then. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2192809070 From thartmann at openjdk.org Tue Jul 8 15:19:44 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 8 Jul 2025 15:19:44 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v4] In-Reply-To: References: Message-ID: <7CvQqj41V2ysWMJACPif3jOCSF9xFypACLH9DiKYnc0=.aad8e7fe-56ba-44a3-bc31-187618af35d6@github.com> On Tue, 8 Jul 2025 14:21:20 GMT, Marc Chevalier wrote: >> src/hotspot/share/opto/divnode.cpp line 1528: >> >>> 1526: bool not_dead = proj_out_or_null(TypeFunc::Control) != nullptr; >>> 1527: if (result_is_unused && not_dead) { >>> 1528: return replace_with_con(igvn, TypeF::make(0.)); >> >> Can we replace all the other usages of `ModFloatingNode::replace_with_con` by `TupleNode` and get rid of that method? > > Yes, that sounds like a good idea. One still need to build the right `TupleNode`, which takes a bit of code. So in detail, I'd rather replace `replace_with_con` with a function returning the right `TupleNode` to be as concise on the call-site. Yes, that sounds good. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2192809839 From mchevalier at openjdk.org Tue Jul 8 15:28:39 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 8 Jul 2025 15:28:39 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v2] In-Reply-To: References: <1cFRkcs5JmgnbWEIaEoT8I9RiUtNxgKieAdkSB2Fgmc=.1d97b5c4-b6ef-43c6-b721-1e52eee19d3a@github.com> Message-ID: On Tue, 8 Jul 2025 15:11:09 GMT, Tobias Hartmann wrote: >> I've added it, since indeed, it could. As far as I understand, I was just lucky in the situation where it happens, but there is no reason it would always hold. > > I think you also need to re-insert it via `hash_find_insert`. // Some Ideal and other transforms delete --> modify --> insert values yes, indeed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2192831829 From eastigeevich at openjdk.org Tue Jul 8 15:40:53 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Tue, 8 Jul 2025 15:40:53 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v34] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 22:24:07 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: > > Update justification for skipping CallRelocation src/hotspot/share/code/nmethod.cpp line 1549: > 1547: #ifdef USE_TRAMPOLINE_STUB_FIX_OWNER > 1548: // Direct calls may no longer be in range and the use of a trampoline may now be required. > 1549: // Instead allow the trapoline relocations to update their owner and perform the necessary checks. > ... trapoline trampoline ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2192857245 From shade at openjdk.org Tue Jul 8 17:04:45 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 17:04:45 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v6] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 09:08:27 GMT, Aleksey Shipilev wrote: > Also realized the Atomic can be relaxed, since nothing rides on its memory consistency effects. I am retesting to make sure. Linux AArch64 server fastdebug `all` passes without new failures. Ready for re-review / more testing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25409#issuecomment-3049680821 From dnsimon at openjdk.org Tue Jul 8 17:13:41 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 8 Jul 2025 17:13:41 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal [v2] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:41:59 GMT, Andrej Pe?im?th wrote: >> This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. > > Andrej Pe?im?th has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > JVMCI refactorings to enable replay compilation in Graal. Looks good. ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25433#pullrequestreview-2998482956 From shade at openjdk.org Tue Jul 8 17:17:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 8 Jul 2025 17:17:41 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 13:33:23 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Just use printf directly Any other reviews needed/wanted? In particular, maybe @TobiHartmann wants to test it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26090#issuecomment-3049715940 From duke at openjdk.org Tue Jul 8 17:37:57 2025 From: duke at openjdk.org (Andrej Pecimuth) Date: Tue, 8 Jul 2025 17:37:57 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal [v3] In-Reply-To: References: Message-ID: > This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. Andrej Pecimuth has updated the pull request incrementally with one additional commit since the last revision: Remove unnecessary public modifier. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25433/files - new: https://git.openjdk.org/jdk/pull/25433/files/c6c3bb62..1b845fa9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25433&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25433&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/25433.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25433/head:pull/25433 PR: https://git.openjdk.org/jdk/pull/25433 From dnsimon at openjdk.org Tue Jul 8 17:50:42 2025 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 8 Jul 2025 17:50:42 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal [v3] In-Reply-To: References: Message-ID: <0ZpzR9oc9eKbKHiDI6XTpi5tN28FhHq4ywHrF4nxSzI=.314d1794-a540-45db-8ab1-26eeb65b77f5@github.com> On Tue, 8 Jul 2025 17:37:57 GMT, Andrej Pecimuth wrote: >> This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. > > Andrej Pecimuth has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary public modifier. Still good. ------------- Marked as reviewed by dnsimon (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25433#pullrequestreview-2998590011 From tschatzl at openjdk.org Tue Jul 8 19:23:47 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Tue, 8 Jul 2025 19:23:47 GMT Subject: RFR: 8350621: Code cache stops scheduling GC In-Reply-To: References: Message-ID: <6XGOkpJ4gQxTjKwvm4VRqo-oqdNdAO4_yMdf9t4U7Tg=.e47311a8-9db6-402e-849e-10e2b1664ad9@github.com> On Sun, 16 Feb 2025 18:39:29 GMT, Alexandre Jacob wrote: > The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`. > > This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling. > > Unfortunately this can't work properly under certain circumstances. > For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)). > > I have observed this behavior on JVM in version 21 that were migrated recently from java 17. > Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen. > > I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well. > > In order to reproduce this issue, I found a very simple and convenient way: > > > public class CodeCacheMain { > public static void main(String[] args) throws InterruptedException { > while (true) { > Thread.sleep(100); > } > } > } > > > Run this simple app with the following JVM flags: > > > -Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15 > > > - 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC > - low `ReservedCodeCacheSize` to put pressure on code cache quickly > - `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction > > Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will: > - allows us to monitor code cache > - indirectly generate activity on the code cache, just what we need to reproduce the bug > > Some logs related to code cache will show up at some point with GC activity: > > > [648.733s][info][codecache ] Triggering aggressive GC due to having only 14.970% free memory > > > And then it will stop and we'll end up with the following message: > > > [672.714s][info][codecache ] Code cache is full - disabling compilation > > > L... We discussed this question internally bit, and the consensus has been to use a new PR to avoid confusion due to two different approaches being discussed in the same thread. Would you mind closing this one out and I'll create a new PR? Thanks, Thomas ------------- PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-3049289764 From duke at openjdk.org Tue Jul 8 19:23:48 2025 From: duke at openjdk.org (Alexandre Jacob) Date: Tue, 8 Jul 2025 19:23:48 GMT Subject: RFR: 8350621: Code cache stops scheduling GC In-Reply-To: References: Message-ID: On Sun, 16 Feb 2025 18:39:29 GMT, Alexandre Jacob wrote: > The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`. > > This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling. > > Unfortunately this can't work properly under certain circumstances. > For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)). > > I have observed this behavior on JVM in version 21 that were migrated recently from java 17. > Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen. > > I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well. > > In order to reproduce this issue, I found a very simple and convenient way: > > > public class CodeCacheMain { > public static void main(String[] args) throws InterruptedException { > while (true) { > Thread.sleep(100); > } > } > } > > > Run this simple app with the following JVM flags: > > > -Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15 > > > - 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC > - low `ReservedCodeCacheSize` to put pressure on code cache quickly > - `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction > > Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will: > - allows us to monitor code cache > - indirectly generate activity on the code cache, just what we need to reproduce the bug > > Some logs related to code cache will show up at some point with GC activity: > > > [648.733s][info][codecache ] Triggering aggressive GC due to having only 14.970% free memory > > > And then it will stop and we'll end up with the following message: > > > [672.714s][info][codecache ] Code cache is full - disabling compilation > > > L... Sure I can close this PR, this makes things simpler for everybody I guess! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-3050053298 From duke at openjdk.org Tue Jul 8 19:23:48 2025 From: duke at openjdk.org (Alexandre Jacob) Date: Tue, 8 Jul 2025 19:23:48 GMT Subject: Withdrawn: 8350621: Code cache stops scheduling GC In-Reply-To: References: Message-ID: On Sun, 16 Feb 2025 18:39:29 GMT, Alexandre Jacob wrote: > The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`. > > This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling. > > Unfortunately this can't work properly under certain circumstances. > For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)). > > I have observed this behavior on JVM in version 21 that were migrated recently from java 17. > Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen. > > I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well. > > In order to reproduce this issue, I found a very simple and convenient way: > > > public class CodeCacheMain { > public static void main(String[] args) throws InterruptedException { > while (true) { > Thread.sleep(100); > } > } > } > > > Run this simple app with the following JVM flags: > > > -Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15 > > > - 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC > - low `ReservedCodeCacheSize` to put pressure on code cache quickly > - `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction > > Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will: > - allows us to monitor code cache > - indirectly generate activity on the code cache, just what we need to reproduce the bug > > Some logs related to code cache will show up at some point with GC activity: > > > [648.733s][info][codecache ] Triggering aggressive GC due to having only 14.970% free memory > > > And then it will stop and we'll end up with the following message: > > > [672.714s][info][codecache ] Code cache is full - disabling compilation > > > L... This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/23656 From kvn at openjdk.org Tue Jul 8 19:37:46 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Jul 2025 19:37:46 GMT Subject: RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 12:31:04 GMT, Matthias Baesken wrote: >> @mbaesken, please verify that is passing ubsan testing now. > >> please verify that is passing ubsan testing now. > > Hi Vladimir, I do not see the mentioned ubsan error (codeBlob.hpp:235:97: runtime error: applying non-zero offset 16 to null pointer) after your patch was added! Thank you, @MBaesken, for review and testing. Thank you, @adinn, for review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26175#issuecomment-3050088171 From kvn at openjdk.org Tue Jul 8 19:37:47 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Jul 2025 19:37:47 GMT Subject: Integrated: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 23:48:03 GMT, Vladimir Kozlov wrote: > `CodeBlob::relocation_size()` is calculated as `(_mutable_data + _relocation_size - _mutable_data)`. `CodeBlob::relocation_size()` is called during AOT code loading before we allocate space for mutable data (the size is used to find how big space should be allocated). The expression at that point is `(NULL + _relocation_size - NULL)` which returns correct result. But we should just return `_relocation_size` which is recorded anyway in AOT data. > > Added missing `_mutable_data = blob_end();` initialization when we restore AOT code blob. > > Fixed embarrassing typo in asserts. > > Tested: tier1-6,8,10,xcomp,stress This pull request has now been integrated. Changeset: dedcce04 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/dedcce045013b3ff84f5ef8857e1a83f0c09f9ad Stats: 7 lines in 2 files changed: 4 ins; 0 del; 3 mod 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() Reviewed-by: adinn, mbaesken ------------- PR: https://git.openjdk.org/jdk/pull/26175 From chagedorn at openjdk.org Tue Jul 8 19:52:42 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 8 Jul 2025 19:52:42 GMT Subject: RFR: 8360175: C2 crash: assert(edge_from_to(prior_use, n)) failed: before block local scheduling In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 13:00:45 GMT, Manuel H?ssig wrote: >> src/hotspot/cpu/x86/peephole_x86_64.cpp line 349: >> >>> 347: Node* dependant_lea = decode->fast_out(i); >>> 348: if (dependant_lea->is_Mach() && dependant_lea->as_Mach()->ideal_Opcode() == Op_AddP) { >>> 349: dependant_lea->set_req(AddPNode::Base, lea_derived_oop->in(AddPNode::Address)); >> >> The fix looks reasonable to me, too. No worries about the regression test, thanks for trying! A small question: Why don't we use `lea_address`? >> >> Another thing I've noticed while browsing the code: `ra_` and `new_root` seem to be unused and could be removed (could probably also be squeezed into this PR here instead of creating a new issue just for that). > > We cannot use `lea_address` because in case of a spill that also gets moved up one node to check if the lea and the decode point to the same grandparent. > >> Another thing I've noticed while browsing the code: ra_ and new_root seem to be unused and could be removed. > > These arguments come from the machinery that calls this out of the matcher. I am not too familiar with it, so my working assumption so far has been to keep the signature the same as the other peepholes, which seems logical since it is called by generated code. Thanks for the explanation, makes sense. > These arguments come from the machinery that calls this out of the matcher. I am not too familiar with it, so my working assumption so far has been to keep the signature the same as the other peepholes, which seems logical since it is called by generated code. Could be the case - I'm not too familiar with it, either. Anyway, the fix looks good! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26157#discussion_r2193326173 From duke at openjdk.org Tue Jul 8 20:03:17 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Tue, 8 Jul 2025 20:03:17 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v35] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [ ] Linux x64 fastdebug all > - [ ] Linux aarch64 fastdebug all > - [ ] ... Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: - Typo - Merge branch 'master' into JDK-8316694-Final - Update justification for skipping CallRelocation - Enclose ImmutableDataReferencesCounterSize in parentheses - Let trampolines fix their owners - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final - Update how call sites are fixed - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final - Fix pointer printing - Use set_destination_mt_safe - ... and 85 more: https://git.openjdk.org/jdk/compare/117f0b40...66d73c16 ------------- Changes: https://git.openjdk.org/jdk/pull/23573/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=34 Stats: 1617 lines in 25 files changed: 1579 ins; 2 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From duke at openjdk.org Tue Jul 8 20:11:51 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Tue, 8 Jul 2025 20:11:51 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 08:41:37 GMT, Andrew Haley wrote: >> Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: >> >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Update how call sites are fixed >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Fix pointer printing >> - Use set_destination_mt_safe >> - Print address as pointer >> - Use new _metadata_size instead of _jvmci_data_size >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Only check branch distance for aarch64 and riscv >> - Move far branch fix to fix_relocation_after_move >> - ... and 80 more: https://git.openjdk.org/jdk/compare/f799cf18...70e4164e > > src/hotspot/cpu/aarch64/relocInfo_aarch64.cpp line 117: > >> 115: } >> 116: >> 117: void poll_Relocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest, bool is_nmethod_relocation) { > > Suggestion: > > void poll_Relocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest, bool) { This change was removed. Same reason as above. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2193354958 From kvn at openjdk.org Tue Jul 8 20:25:45 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Jul 2025 20:25:45 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v4] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 13:53:25 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Somehow intellij doesn't remove empty indented line src/hotspot/share/opto/library_call.hpp line 150: > 148: void restore_state(const SavedState&); > 149: void destruct_map_clone(const SavedState& sfp); > 150: Can this be a class instead of struct? These methods could be members. Initialization can be done through constructor. The destructor can do restoration by default unless `destruct_map_clone()` was called before. I don't like name `destruct_map_clone()` for this. How about `SavedState::remove()` or something. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2193378458 From kvn at openjdk.org Tue Jul 8 20:31:40 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 8 Jul 2025 20:31:40 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v6] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:59:53 GMT, Aleksey Shipilev wrote: >> See bug for more discussion. >> >> This PR implements the "all the way" solution by removing the free list completely. It complements https://github.com/openjdk/jdk/pull/25364, and can go either first, or second. We will remerge the other one once either integrates. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` >> - [x] Linux AArch64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: > > - Task count atomic can be relaxed > - Minor touchup in ~CompileTask > - Purge CompileTaskAlloc_lock completely > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Also free the lock! > - Comments and indenting > - ... and 1 more: https://git.openjdk.org/jdk/compare/7b255b8a...684f83b7 re-approved ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25409#pullrequestreview-2999020514 From sparasa at openjdk.org Tue Jul 8 22:44:55 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 8 Jul 2025 22:44:55 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: Message-ID: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> > The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. > > In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. > > Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: rename to paired_push and paired_pop ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25889/files - new: https://git.openjdk.org/jdk/pull/25889/files/8ea6d6ce..24e6da2c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25889&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25889&range=00-01 Stats: 370 lines in 23 files changed: 0 ins; 1 del; 369 mod Patch: https://git.openjdk.org/jdk/pull/25889.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25889/head:pull/25889 PR: https://git.openjdk.org/jdk/pull/25889 From sparasa at openjdk.org Tue Jul 8 22:44:55 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 8 Jul 2025 22:44:55 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: Message-ID: <1qCd44HFg1tS1jUP-FIuzjUX8ePGDR2ViSFmhHsc_yE=.82e01450-c341-4cfa-909e-91f7be1d70a3@github.com> On Thu, 3 Jul 2025 04:53:15 GMT, David Holmes wrote: >> Just a drive-by comment as this isn't code I normally have much to do with but to me it would look a lot cleaner to define `push_paired`/`pop_paired` (maybe abbreviating directly to `pushp`/`popp`?) rather than passing the boolean. > >> @dholmes-ora would you mind sharing your opinion? We seem to be making things more complicated, but hopefully in a good way? > > Seems very complicated to me. Really this is for compiler folk to discuss. And as noted above this "tracker" class only helps where the push/pop are paired in the same scope. Personally I think a "pushp" that is defined to be a "push-paired" when available, else a regular "push", would suffice in terms of API design. But again this is for compiler folk to determine. > Like @dholmes-ora, I also prefer a new function (in MacroAssembler) instead of flags. Though I like the names `paired_push`/`paired_pop`.. > Please see the updated code with `paired_push`/`paired_pop`. Thanks for the suggestions! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25889#issuecomment-3050476284 From duke at openjdk.org Wed Jul 9 00:32:48 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 9 Jul 2025 00:32:48 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 22:21:26 GMT, Chad Rakoczy wrote: >> src/hotspot/cpu/aarch64/relocInfo_aarch64.cpp line 84: >> >>> 82: if (NativeCall::is_call_at(addr())) { >>> 83: NativeCall* call = nativeCall_at(addr()); >>> 84: if (be_safe) { >> >> Why is this change necessary? > > The original motivation was to address far call sites. After relocation, some calls that previously didn't require a trampoline might now need one, hence the introduction of the `be_safe` parameter. However, upon further review, this change is unnecessary. The method `trampoline_stub_Relocation::fix_relocation_after_move` already updates the owner and contains the logic to determine whether a direct call can be performed. Therefore, we can skip invoking `CallRelocation::fix_relocation_after_move` for calls that use trampolines, as all required adjustments will be handled correctly by the trampoline relocations. ([Reference](https://github.com/chadrako/jdk/blob/0f4ff9646d1f7f43214c5ccd4bbe572fffd08d16/src/hotspot/share/code/nmethod.cpp#L1547-L1556)) @dean-long What are your thoughts on this solution? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2193667769 From fjiang at openjdk.org Wed Jul 9 01:22:53 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 9 Jul 2025 01:22:53 GMT Subject: RFR: 8361504: RISC-V: Make C1 clone intrinsic platform guard more specific In-Reply-To: References: <16DUz5Iytmw9i7wAxTx_oU4eeJBCsOI_15qzFP6M4GU=.8a5305a3-2b8d-4b9d-957a-430600bff4b4@github.com> Message-ID: On Tue, 8 Jul 2025 00:39:39 GMT, Fei Yang wrote: >> Hi all. >> Please review this trivial patch, which changes the C1 primitive array clone intrinsic RISCV platform guard into RISCV64. As we only support RISCV64 for now. > > Thanks. @RealFYang @zifeihan -- Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26161#issuecomment-3050732452 From fjiang at openjdk.org Wed Jul 9 01:22:53 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 9 Jul 2025 01:22:53 GMT Subject: Integrated: 8361504: RISC-V: Make C1 clone intrinsic platform guard more specific In-Reply-To: <16DUz5Iytmw9i7wAxTx_oU4eeJBCsOI_15qzFP6M4GU=.8a5305a3-2b8d-4b9d-957a-430600bff4b4@github.com> References: <16DUz5Iytmw9i7wAxTx_oU4eeJBCsOI_15qzFP6M4GU=.8a5305a3-2b8d-4b9d-957a-430600bff4b4@github.com> Message-ID: On Mon, 7 Jul 2025 15:03:52 GMT, Feilong Jiang wrote: > Hi all. > Please review this trivial patch, which changes the C1 primitive array clone intrinsic RISCV platform guard into RISCV64. As we only support RISCV64 for now. This pull request has now been integrated. Changeset: 54e37629 Author: Feilong Jiang URL: https://git.openjdk.org/jdk/commit/54e37629f63eae7800415fa22684e6b3df3648ec Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod 8361504: RISC-V: Make C1 clone intrinsic platform guard more specific Reviewed-by: fyang, gcao ------------- PR: https://git.openjdk.org/jdk/pull/26161 From xgong at openjdk.org Wed Jul 9 01:23:43 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 9 Jul 2025 01:23:43 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: > ### Background > On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. > > For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. > > To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. > > ### Impact Analysis > #### 1. Vector types > Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. > > #### 2. Vector API > No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. > > #### 3. Auto-vectorization > Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. > > #### 4. Codegen of vector nodes > NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. > > Details: > - Lanewise vector operations are unaffected as explained above. > - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). > - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_s... Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Disable auto-vectorization of double to short conversion for NEON and update tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26057/files - new: https://git.openjdk.org/jdk/pull/26057/files/dfda42a3..7fdc357a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=02-03 Stats: 53 lines in 4 files changed: 35 ins; 0 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/26057.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26057/head:pull/26057 PR: https://git.openjdk.org/jdk/pull/26057 From xgong at openjdk.org Wed Jul 9 01:23:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 9 Jul 2025 01:23:44 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> Message-ID: On Tue, 8 Jul 2025 10:33:50 GMT, Fei Gao wrote: > > > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 > > > > > > > > > > > > > > > > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. > > > > > > > > > > > > > > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. > > > > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 > > > > > > > > > > Since SuperWord assigns `T_SHORT` to `StoreC` early on > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 > > > > > > > > > > the entire propagation chain tends to use `T_SHORT` as well. > > > > > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. > > > > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. > > > > > > > > > > > > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? > > > > > > > > > No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 > > > > > > Yes, I see. Thanks! What I mean is for cases that SLP will use the subword types, it will be actually `T_SHORT` for `T_CHAR` then? > > From my side, the cases where SLP uses subword types can be roughly categorized into two groups: > > 1. Cases where the compiler doesn?t need to preserve the higher-order bits ? in these, SuperWord will use `T_SHORT` instead of `T_CHAR`. > 2. Cases where the compiler does need to preserve the higher-order bits, like `RShiftI`, `Abs`, and `ReverseBytesI` ? in these, `T_CHAR` is still used. Thanks for your explanation! I'v updated the ad file and jtreg tests in latest commit. Could you please take a look at again? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3050732618 From xgong at openjdk.org Wed Jul 9 01:23:43 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 9 Jul 2025 01:23:43 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <1bwEx4HAqqmfw9DrslwrZH1cYfIKi-5p9AgelJrIB94=.f46dd942-2102-4fd3-adfd-7f7ec3c3dbc0@github.com> References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> <1bwEx4HAqqmfw9DrslwrZH1cYfIKi-5p9AgelJrIB94=.f46dd942-2102-4fd3-adfd-7f7ec3c3dbc0@github.com> Message-ID: On Tue, 8 Jul 2025 09:07:00 GMT, Xiaohong Gong wrote: >>> > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 >>> > > > >>> > > > >>> > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. >>> > > >>> > > >>> > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. >>> > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: >>> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 >>> > > >>> > > Since SuperWord assigns `T_SHORT` to `StoreC` early on >>> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 >>> > > >>> > > the entire propagation chain tends to use `T_SHORT` as well. >>> > > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. >>> > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. >>> > >>> > >>> > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? >>> >>> No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: >>> >>> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 >> >> Yes, I see. Thanks! What I mean is for cases that SLP will use the sub... > >> > > Hi @XiaohongGong, is there any way we can implement 2HF -> 2S and 2S -> 2HF in these match rules ? >> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4697 >> > > >> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4679 >> > > >> > > The `fcvtn` and `fcvtl` instructions do not support these arrangements. I was wondering if there is any other way we can implement these by any chance? >> > >> > >> > Do you mean `2HF -> 2F` and `2F -> 2HF` ? >> > Yes, it does not support the 32-bit arrangements. Vector conversion is a kind of lanewise vector operation. For such cases, we usually use the same arrangements with 64-bit vector size for 32-bit ones. That means we can reuse the `T4H` and `T4S` to implement it. Hence, current match rules can cover the conversions between `2HF` and `2F`. >> > Consider there is no such conversion cases in Vector API, I didn't change the comment in the match rules. I think this may benefit auto-vectorization. Currently, do we have cases that can match these rules with SLP? >> >> Sorry yes I meant 2HF <-> 2F. Yes, currently there are no such cases in VectorAPI as we do not support Float16 Vectors yet but this will benefit autovectorization cases. I think in this case this may also benefit 2D <-> 2HF as well (eventually we might add support for D <-> HF as well). Yes we have some JTREG tests that match these rules currently like - `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorConvChain.java`, `test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java`. > > Thanks! So per my understanding, things that I just need is updating comment (e.g. `// 4HF to 4F`) of rules like `vcvtHFtoF`, right? For conversions between double and HF, we do not need any new rules as it will be actually `double -> float -> HF`, right? > > > > > Hi @XiaohongGong, is there any way we can implement 2HF -> 2S and 2S -> 2HF in these match rules ? > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4697 > > > > > > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4679 > > > > > > > > > > The `fcvtn` and `fcvtl` instructions do not support these arrangements. I was wondering if there is any other way we can implement these by any chance? > > > > > > > > > > > > Do you mean `2HF -> 2F` and `2F -> 2HF` ? > > > > Yes, it does not support the 32-bit arrangements. Vector conversion is a kind of lanewise vector operation. For such cases, we usually use the same arrangements with 64-bit vector size for 32-bit ones. That means we can reuse the `T4H` and `T4S` to implement it. Hence, current match rules can cover the conversions between `2HF` and `2F`. > > > > Consider there is no such conversion cases in Vector API, I didn't change the comment in the match rules. I think this may benefit auto-vectorization. Currently, do we have cases that can match these rules with SLP? > > > > > > > > > Sorry yes I meant 2HF <-> 2F. Yes, currently there are no such cases in VectorAPI as we do not support Float16 Vectors yet but this will benefit autovectorization cases. I think in this case this may also benefit 2D <-> 2HF as well (eventually we might add support for D <-> HF as well). Yes we have some JTREG tests that match these rules currently like - `test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorConvChain.java`, `test/hotspot/jtreg/compiler/vectorization/TestFloatConversionsVector.java`. > > > > > > Thanks! So per my understanding, things that I just need is updating comment (e.g. `// 4HF to 4F`) of rules like `vcvtHFtoF`, right? For conversions between double and HF, we do not need any new rules as it will be actually `double -> float -> HF`, right? > > Yes please and also for `4F to 4HF` case for `vcvtF2HF`. Thanks! > > As for the double to half float conversion - yes with the current infrastructure it would be ConvD2F -> ConvF2HF which will be autovectorized to generate corresponding vector nodes. Sooner or later, support for ConvD2HF (and its vectorized version) might be added upstream (support already available in `lworld+fp16` branch of Valhalla here - https://github.com/openjdk/valhalla/blob/0ed65b9a63405e950c411835120f0f36e326aaaa/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4535). You do not have to add any new rules now for this case. I was just hinting at possible D<->HF implementation in the future. As the max vector length was 64bits, I did not add any implementation for Neon vcvtD2HF or vcvtHF2D in Valhalla. Maybe we can do two `fcvtl/fcvtn` to convert D to F and then F to HF for this specific case but we can think about that later :) Make sense to me. The latest change has been updated together with the relative jtreg tests. Would you mind taking another look at it? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3050730818 From xgong at openjdk.org Wed Jul 9 01:26:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 9 Jul 2025 01:26:44 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> <5H0NP8vFqCDf1JgHIDee3WrYRbJ6koj5wQsxEGTW8nI=.87d74c6a-54b3-45cc-a972-c4350d5e2acf@github.com> <0XcbEZkrW7fvJhPwQPP1UtT9aC3_OnT7sjoiHo0fOuQ=.1ec5ae91-55cb-4d8b-9e91-44ec02e63747@github.com> <0vdCJFYxCI6hFnTL6rm3oKQcPuuIR2EbuyAOa0muqcw=.d5c249cb-9bf0-415d-ab22-de7387d8d8d1@github.com> Message-ID: <1Zh7gvEryldv1xZZWYETiDjUzK_i1ea4H6U1EFEoeZU=.ef8296ca-f14f-4f4a-98c0-3b6b6d0721ed@github.com> On Tue, 8 Jul 2025 10:33:50 GMT, Fei Gao wrote: >>> > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 >>> > > > >>> > > > >>> > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. >>> > > >>> > > >>> > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. >>> > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: >>> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 >>> > > >>> > > Since SuperWord assigns `T_SHORT` to `StoreC` early on >>> > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 >>> > > >>> > > the entire propagation chain tends to use `T_SHORT` as well. >>> > > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. >>> > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. >>> > >>> > >>> > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? >>> >>> No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: >>> >>> https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 >> >> Yes, I see. Thanks! What I mean is for cases that SLP will use the sub... > >> > > > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java#L388-L392 >> > > > > >> > > > > >> > > > > Actually I didn't change the min vector size for `char` vectors in this patch. Relaxing `short` vectors to 32-bit is to support the vector cast for Vector API, and there is no `char` species in it. Do you think it's better to do the same change for `char` as well? This will just benefit auto-vectorization. >> > > > >> > > > >> > > > Hi @XiaohongGong thanks for asking. In many auto-vectorization cases involving `char`, the vector elements are represented using `T_SHORT` as the `BasicType`, rather than `T_CHAR`. >> > > > This is because, in Java, operands of subword types are always promoted to `int` before any arithmetic operation. As a result, when handling a node like `ConvD2I`, we don?t initially know its actual subword type. Later, the SuperWord phase propagates a narrowed integer type backward to help determine the correct subword type. See: >> > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2551-L2558 >> > > > >> > > > Since SuperWord assigns `T_SHORT` to `StoreC` early on >> > > > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2646-L2650 >> > > > >> > > > the entire propagation chain tends to use `T_SHORT` as well. >> > > > This applies to most operations, with the exception of a few like `RShiftI`, `Abs`, and `ReverseBytesI`, which are handled separately. >> > > > So your change already benefits many char-related vectorization cases like `convertDoubleToChar` above. That?s why we can safely relax the IR condition mentioned earlier. >> > > >> > > >> > > Thanks for your input! It's really helpful to me. Does this mean it always use `T_SHORT` for char vectors in SLP? If so, it's safe that we do not need to consider `T_CHAR` in vector IRs in backend? >> > >> > >> > No, we don't always use `T_SHORT` for char vectors. As mentioned earlier, for operations like `RShiftI`, `Abs`, and `ReverseBytesI`, the compiler needs to preserve the higher-order bits of the first operand. Therefore, SuperWord still needs to assign them precise subword types. See: >> > https://github.com/openjdk/jdk/blob/f2d2eef988c57cc9f6194a8fd5b2b422035ee68f/src/hotspot/share/opto/superword.cpp#L2583-L2589 >> >> Yes, I see. Thanks! What I mean is for cases th... @fg1417 , there is performance regression of `D -> S` on NEON for SLP. I'v disabled the case in latest change. And here is the performance data of JMH `TypeVectorOperations` on Grace (the 128-bit SVE machine) and N1 (NEON) respectively: Grace: Benchmark COUNT Mode Unit Before After Ratio TypeVectorOperationsSuperWord.convertD2S 512 avgt ns/op 155.667433 123.222497 1.26 TypeVectorOperationsSuperWord.convertD2S 2048 avgt ns/op 622.262384 489.336020 1.27 TypeVectorOperationsSuperWord.convertL2S 512 avgt ns/op 93.173939 63.557134 1.46 TypeVectorOperationsSuperWord.convertL2S 2048 avgt ns/op 365.287938 239.726941 1.52 TypeVectorOperationsSuperWord.convertS2D 512 avgt ns/op 157.096344 147.560047 1.06 TypeVectorOperationsSuperWord.convertS2D 2048 avgt ns/op 627.039963 614.748559 1.01 TypeVectorOperationsSuperWord.convertS2L 512 avgt ns/op 111.752970 108.629240 1.02 TypeVectorOperationsSuperWord.convertS2L 2048 avgt ns/op 441.312737 441.088523 1.00 N1: Benchmark COUNT Mode Unit Before After Ratio TypeVectorOperationsSuperWord.convertD2S 512 avgt ns/op 215.353528 214.769884 1.00 TypeVectorOperationsSuperWord.convertD2S 2048 avgt ns/op 958.428871 952.922855 1.00 TypeVectorOperationsSuperWord.convertL2S 512 avgt ns/op 158.000190 142.647209 1.10 TypeVectorOperationsSuperWord.convertL2S 2048 avgt ns/op 612.525835 532.023419 1.15 TypeVectorOperationsSuperWord.convertS2D 512 avgt ns/op 209.993363 210.466401 0.99 TypeVectorOperationsSuperWord.convertS2D 2048 avgt ns/op 819.181052 803.601170 1.01 TypeVectorOperationsSuperWord.convertS2L 512 avgt ns/op 217.848273 182.680450 1.19 TypeVectorOperationsSuperWord.convertS2L 2048 avgt ns/op 858.031089 695.502377 1.23 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3050738693 From dzhang at openjdk.org Wed Jul 9 02:47:40 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 9 Jul 2025 02:47:40 GMT Subject: RFR: 8361532: RISC-V: Several vector tests fail after JDK-8354383 In-Reply-To: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> References: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> Message-ID: On Tue, 8 Jul 2025 02:30:27 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8354383](https://bugs.openjdk.org/browse/JDK-8354383) , several test cases fail when fastdebug with RVV. > The reason for the error is that riscv lacks CastVV with dst as the mask register. > This PR adds the corresponding matching rules. > > ### Testing > qemu-system with RVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26178#issuecomment-3050924135 From duke at openjdk.org Wed Jul 9 02:47:40 2025 From: duke at openjdk.org (duke) Date: Wed, 9 Jul 2025 02:47:40 GMT Subject: RFR: 8361532: RISC-V: Several vector tests fail after JDK-8354383 In-Reply-To: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> References: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> Message-ID: On Tue, 8 Jul 2025 02:30:27 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8354383](https://bugs.openjdk.org/browse/JDK-8354383) , several test cases fail when fastdebug with RVV. > The reason for the error is that riscv lacks CastVV with dst as the mask register. > This PR adds the corresponding matching rules. > > ### Testing > qemu-system with RVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) @DingliZhang Your change (at version 58edbc93bd5a951c3863f2666827dd075b96dce5) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26178#issuecomment-3050924709 From gcao at openjdk.org Wed Jul 9 03:32:42 2025 From: gcao at openjdk.org (Gui Cao) Date: Wed, 9 Jul 2025 03:32:42 GMT Subject: RFR: 8361532: RISC-V: Several vector tests fail after JDK-8354383 In-Reply-To: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> References: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> Message-ID: On Tue, 8 Jul 2025 02:30:27 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8354383](https://bugs.openjdk.org/browse/JDK-8354383) , several test cases fail when fastdebug with RVV. > The reason for the error is that riscv lacks CastVV with dst as the mask register. > This PR adds the corresponding matching rules. > > ### Testing > qemu-system with RVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) Marked as reviewed by gcao (Author). ------------- PR Review: https://git.openjdk.org/jdk/pull/26178#pullrequestreview-2999838017 From dzhang at openjdk.org Wed Jul 9 06:00:48 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 9 Jul 2025 06:00:48 GMT Subject: Integrated: 8361532: RISC-V: Several vector tests fail after JDK-8354383 In-Reply-To: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> References: <-DfrHsd_D9lqbcRmNvF67dHOBaoxGQWUwTeUIa1IvfA=.d3f8182e-c581-4b88-89dd-f5f0781e7b67@github.com> Message-ID: <1963-CTLhk27cAAMMn37-PBZBOTBxRyhhWLANhipOIM=.c4daa9ad-5c58-4ee4-8588-881e3655c4d8@github.com> On Tue, 8 Jul 2025 02:30:27 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8354383](https://bugs.openjdk.org/browse/JDK-8354383) , several test cases fail when fastdebug with RVV. > The reason for the error is that riscv lacks CastVV with dst as the mask register. > This PR adds the corresponding matching rules. > > ### Testing > qemu-system with RVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) This pull request has now been integrated. Changeset: e0245682 Author: Dingli Zhang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/e0245682c8d5a0daae055045c81248c12fb23c09 Stats: 11 lines in 1 file changed: 11 ins; 0 del; 0 mod 8361532: RISC-V: Several vector tests fail after JDK-8354383 Reviewed-by: fyang, fjiang, gcao ------------- PR: https://git.openjdk.org/jdk/pull/26178 From duke at openjdk.org Wed Jul 9 06:08:33 2025 From: duke at openjdk.org (erifan) Date: Wed, 9 Jul 2025 06:08:33 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v11] In-Reply-To: References: Message-ID: > This patch optimizes the following patterns: > For integer types: > > (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1)) > => (VectorMaskCmp src1 src2 ncond) > (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1)) > => (VectorMaskCmp src1 src2 ncond) > > cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond. > > For float and double types: > > (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1)) > => (VectorMaskCast (VectorMaskCmp src1 src2 ncond)) > > cond can be eq or ne. > > Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`: > > Benchmark Unit Before Score Error After Score Error Uplift > testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29 > testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33 > testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33 > testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31 > testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31 > testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41 > testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29 > testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39 > testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37 > testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41 > testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29 > testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4 > testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38 > testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41 > testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29 > testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4 > testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38 > testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41 > testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29 > testCompareLTMaskNotInt ops/s 1672180.09 995.238142 2353757.863 853.774734 1.4 > testCompareLTMaskNotLong ops/s 856502.26... erifan has updated the pull request incrementally with one additional commit since the last revision: Update the code comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24674/files - new: https://git.openjdk.org/jdk/pull/24674/files/db78dc43..04142a19 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24674&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24674&range=09-10 Stats: 6 lines in 1 file changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/24674.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24674/head:pull/24674 PR: https://git.openjdk.org/jdk/pull/24674 From duke at openjdk.org Wed Jul 9 06:18:48 2025 From: duke at openjdk.org (erifan) Date: Wed, 9 Jul 2025 06:18:48 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v10] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 11:42:02 GMT, Emanuel Peter wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: >> >> - Align indentation >> - Merge branch 'master' into JDK-8354242 >> - Address more comments >> >> ATT. >> - Merge branch 'master' into JDK-8354242 >> - Support negating unsigned comparison for BoolTest::mask >> >> Added a static method `negate_mask(mask btm)` into BoolTest class to >> negate both signed and unsigned comparison. >> - Addressed some review comments >> - Merge branch 'master' into JDK-8354242 >> - Refactor the JTReg tests for compare.xor(maskAll) >> >> Also made a bit change to support pattern `VectorMask.fromLong()`. >> - Merge branch 'master' into JDK-8354242 >> - Refactor code >> >> Add a new function XorVNode::Ideal_XorV_VectorMaskCmp to do this >> optimization, making the code more modular. >> - ... and 7 more: https://git.openjdk.org/jdk/compare/04bd77d0...db78dc43 > > src/hotspot/share/opto/vectornode.cpp line 2241: > >> 2239: in1->outcnt() != 1 || >> 2240: !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || >> 2241: !VectorNode::is_all_ones_vector(in2)) { > > Suggestion: > > !in1->as_VectorMaskCmp()->predicate_can_be_negated() || > !VectorNode::is_all_ones_vector(in2)) { > > Remove the indentation again, and the superfluous brackets too ;) Done ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2194130835 From duke at openjdk.org Wed Jul 9 06:18:48 2025 From: duke at openjdk.org (erifan) Date: Wed, 9 Jul 2025 06:18:48 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v9] In-Reply-To: <2UzxnawLUtlwIr5aaEdTfn4OEMt_z1HTfAaDBHCeZFU=.a70d1360-574a-4ca9-adae-7dec030ed2b7@github.com> References: <2UzxnawLUtlwIr5aaEdTfn4OEMt_z1HTfAaDBHCeZFU=.a70d1360-574a-4ca9-adae-7dec030ed2b7@github.com> Message-ID: On Tue, 8 Jul 2025 11:42:18 GMT, Emanuel Peter wrote: >> Oh wow, my bad. I misunderstood the brackets! >> >> Instead of: >> >> !(in1->as_VectorMaskCmp())->predicate_can_be_negated() || >> !VectorNode::is_all_ones_vector(in2)) { >> >> I read: >> >> !(in1->as_VectorMaskCmp()->predicate_can_be_negated() || >> !VectorNode::is_all_ones_vector(in2))) { >> >> That confused me a lot... absolutely my bad. >> >> Well actually then my indentation suggestion was terrible! > > I made a new suggestion below. > A code comment would be helpful for this case. I updated the comment above the code a bit. As for why predicate need to be negatable, it's straightforward, the key of this optimization is to change predicate condition into negative predicate condition. And in `predicate_can_be_negated`, there's a comment explaining when predicate can't be negated. > I made a new suggestion below. Done. > That confused me a lot... absolutely my bad. Well actually then my indentation suggestion was terrible! No problem. I'm a newbie in the JDK community, so generally I think your suggestions are valuable.Thanks for your review! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24674#discussion_r2194130234 From thartmann at openjdk.org Wed Jul 9 06:58:38 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 9 Jul 2025 06:58:38 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 13:33:23 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Just use printf directly Sure, I'll run testing and report back. Sorry for the delay. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26090#issuecomment-3051401437 From duke at openjdk.org Wed Jul 9 07:29:39 2025 From: duke at openjdk.org (duke) Date: Wed, 9 Jul 2025 07:29:39 GMT Subject: RFR: 8357689: Refactor JVMCI to enable replay compilation in Graal [v3] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 17:37:57 GMT, Andrej Pecimuth wrote: >> This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. > > Andrej Pecimuth has updated the pull request incrementally with one additional commit since the last revision: > > Remove unnecessary public modifier. @pecimuth Your change (at version 1b845fa92383367026d8072ba5a9525ded15dccb) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25433#issuecomment-3051484038 From tschatzl at openjdk.org Wed Jul 9 07:46:53 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Wed, 9 Jul 2025 07:46:53 GMT Subject: RFR: 8350621: Code cache stops scheduling GC In-Reply-To: References: Message-ID: On Sun, 16 Feb 2025 18:39:29 GMT, Alexandre Jacob wrote: > The purpose of this PR is to fix a bug where we can end up in a situation where the GC is not scheduled anymore by `CodeCache`. > > This situation is possible because the `_unloading_threshold_gc_requested` flag is set to `true` when triggering the GC and we expect the GC to call `CodeCache::on_gc_marking_cycle_finish` which in turn will call `CodeCache::update_cold_gc_count`, which will reset the flag `_unloading_threshold_gc_requested` allowing further GC scheduling. > > Unfortunately this can't work properly under certain circumstances. > For example, if using G1GC, calling `G1CollectedHeap::collect` does no give the guarantee that the GC will actually run as it can be already running (see [here](https://github.com/openjdk/jdk/blob/7d11418c820b46926a25907766d16083a4b349de/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#L1763)). > > I have observed this behavior on JVM in version 21 that were migrated recently from java 17. > Those JVMs have some pressure on code cache and quite a large heap in comparison to allocation rate, which means that objects are mostly GC'd by young collections and full GC take a long time to happen. > > I have been able to reproduce this issue with ParallelGC and G1GC, and I imagine that other GC can be impacted as well. > > In order to reproduce this issue, I found a very simple and convenient way: > > > public class CodeCacheMain { > public static void main(String[] args) throws InterruptedException { > while (true) { > Thread.sleep(100); > } > } > } > > > Run this simple app with the following JVM flags: > > > -Xlog:gc*=info,codecache=info -Xmx512m -XX:ReservedCodeCacheSize=2496k -XX:StartAggressiveSweepingAt=15 > > > - 512m for the heap just to clarify the intent that we don't want to be bothered by a full GC > - low `ReservedCodeCacheSize` to put pressure on code cache quickly > - `StartAggressiveSweepingAt` can be set to 20 or 15 for faster bug reproduction > > Itself, the program will hardly get pressure on code cache, but the good news is that it is sufficient to attach a jconsole on it which will: > - allows us to monitor code cache > - indirectly generate activity on the code cache, just what we need to reproduce the bug > > Some logs related to code cache will show up at some point with GC activity: > > > [648.733s][info][codecache ] Triggering aggressive GC due to having only 14.970% free memory > > > And then it will stop and we'll end up with the following message: > > > [672.714s][info][codecache ] Code cache is full - disabling compilation > > > L... Thank you. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23656#issuecomment-3051539837 From duke at openjdk.org Wed Jul 9 08:22:44 2025 From: duke at openjdk.org (Andrej Pecimuth) Date: Wed, 9 Jul 2025 08:22:44 GMT Subject: Integrated: 8357689: Refactor JVMCI to enable replay compilation in Graal In-Reply-To: References: Message-ID: On Sat, 24 May 2025 16:49:23 GMT, Andrej Pecimuth wrote: > This PR introduces a few minor JVMCI refactorings to make replay compilation possible in the Graal compiler. This pull request has now been integrated. Changeset: 963b83fc Author: Andrej Pecimuth Committer: Doug Simon URL: https://git.openjdk.org/jdk/commit/963b83fcf158d273e9433b6845380184b3ad0de5 Stats: 258 lines in 16 files changed: 224 ins; 7 del; 27 mod 8357689: Refactor JVMCI to enable replay compilation in Graal Reviewed-by: dnsimon ------------- PR: https://git.openjdk.org/jdk/pull/25433 From aph at openjdk.org Wed Jul 9 08:30:42 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 9 Jul 2025 08:30:42 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: References: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> Message-ID: On Tue, 8 Jul 2025 12:03:33 GMT, Evgeny Astigeevich wrote: >>> Erm, I don't see why? >> >> To make it unique? I don't see why you don't see that we need to ensure that the string is unique. > > @theRealAph, are you okay with the latest variant Aleksey proposes? > But this is a generic macroAssembler block comment. It does not make sense to me to have a block comment that looks like a memory corrupted string, just to satisfy a single test. There should be a middle-ground here, e.g. `spin_wait {`, which looks reasonably enough as the block comment, and not anything else. Would also match nicely when we emit the closing `}` at the end of the instruction block. OK ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2194396280 From fgao at openjdk.org Wed Jul 9 09:01:49 2025 From: fgao at openjdk.org (Fei Gao) Date: Wed, 9 Jul 2025 09:01:49 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 01:23:43 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Disable auto-vectorization of double to short conversion for NEON and update tests @XiaohongGong Thanks for testing it, and also for your update ? much appreciated! ------------- Marked as reviewed by fgao (Committer). PR Review: https://git.openjdk.org/jdk/pull/26057#pullrequestreview-3000660649 From xgong at openjdk.org Wed Jul 9 09:09:41 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 9 Jul 2025 09:09:41 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 01:23:43 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Disable auto-vectorization of double to short conversion for NEON and update tests Hi @eme64 , could you please help take a look at this patch especially the test part since most of the tests are SLP related? It will be helpful if you could also help trigger a testing for it. Thanks for your time! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3051800402 From epeter at openjdk.org Wed Jul 9 09:16:44 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Wed, 9 Jul 2025 09:16:44 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 09:06:44 GMT, Xiaohong Gong wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Disable auto-vectorization of double to short conversion for NEON and update tests > > Hi @eme64 , could you please help take a look at this patch especially the test part since most of the tests are SLP related? It will be helpful if you could also help trigger a testing for it. Thanks for your time! @XiaohongGong I would love to review and test, but I'm about to go on vacation and will only be back in August. I've pinged some others internally, and hope someone will pick this up! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3051824198 From chagedorn at openjdk.org Wed Jul 9 09:18:53 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 9 Jul 2025 09:18:53 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v36] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:43:31 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 97 commits: > > - review > - Merge branch 'master' into JDK-8342692 > - Update src/hotspot/share/opto/c2_globals.hpp > > Co-authored-by: Christian Hagedorn > - small fix > - Merge branch 'master' into JDK-8342692 > - review > - review > - Update test/micro/org/openjdk/bench/java/lang/foreign/HeapMismatchManualLoopTest.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopScaleOverflow.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopPredicatesClone.java > > Co-authored-by: Christian Hagedorn > - ... and 87 more: https://git.openjdk.org/jdk/compare/310ef856...bb69cc02 Thanks for the update, still good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-3000719709 From xgong at openjdk.org Wed Jul 9 09:19:42 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 9 Jul 2025 09:19:42 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 09:06:44 GMT, Xiaohong Gong wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Disable auto-vectorization of double to short conversion for NEON and update tests > > Hi @eme64 , could you please help take a look at this patch especially the test part since most of the tests are SLP related? It will be helpful if you could also help trigger a testing for it. Thanks for your time! > @XiaohongGong I would love to review and test, but I'm about to go on vacation and will only be back in August. I've pinged some others internally, and hope someone will pick this up! Thanks a lot for the help! Sounds good to me and have a good holiday! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3051835903 From fjiang at openjdk.org Wed Jul 9 10:07:31 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 9 Jul 2025 10:07:31 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v4] In-Reply-To: References: Message-ID: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> > Hi, please consider. > [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. > The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. > If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. > This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. > We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. > > This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. > The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. > > Test on linux-riscv64: > - [x] Tier1-3 > > JMH data on P550 SBC for reference (w/o and w/ the patch): > > Before: > > Without COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op > ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op > ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op > ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op > ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op > ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op > ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op > ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op > ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op > ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op > ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op > ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op > ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op > ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op > ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op > > ------------------------------------------------------------------------- > With COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.667 ? 0.622 ns/op > Arra... Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - Revert RISCV Macro modification - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses - riscv: fix c1 primitive array clone intrinsic regression ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25976/files - new: https://git.openjdk.org/jdk/pull/25976/files/3a502f84..ca628e16 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=02-03 Stats: 8585 lines in 322 files changed: 4290 ins; 1434 del; 2861 mod Patch: https://git.openjdk.org/jdk/pull/25976.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25976/head:pull/25976 PR: https://git.openjdk.org/jdk/pull/25976 From chagedorn at openjdk.org Wed Jul 9 10:37:52 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 9 Jul 2025 10:37:52 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v36] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:43:31 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 97 commits: > > - review > - Merge branch 'master' into JDK-8342692 > - Update src/hotspot/share/opto/c2_globals.hpp > > Co-authored-by: Christian Hagedorn > - small fix > - Merge branch 'master' into JDK-8342692 > - review > - review > - Update test/micro/org/openjdk/bench/java/lang/foreign/HeapMismatchManualLoopTest.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopScaleOverflow.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopPredicatesClone.java > > Co-authored-by: Christian Hagedorn > - ... and 87 more: https://git.openjdk.org/jdk/compare/310ef856...bb69cc02 I gave your latest patch another spin in our testing. It's still running but it already found some issues: - SA tests (see separate comment) - `#include` order problem (see separate comment) - Various `jdk/incubator/vector/*` tests are failing, for example `Byte128VectorLoadStoreTests.java`: Additional VM flags: `-XX:UseAVX=2` (it also reproduces with 0 and 1 so far) # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/opt/mach5/mesos/work_dir/slaves/d2398cde-9325-49c3-b030-8961a4f0a253-S650407/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/05605dc0-bf5e-434a-82b5-65af69c62ec6/runs/591d89b1-11c0-415e-b2ce-4c0a13ce80f8/workspace/open/src/hotspot/share/opto/vectorization.cpp:141), pid=704535, tid=704555 # assert(_cl->is_multiversion_fast_loop() == (_multiversioning_fast_proj != nullptr)) failed: must find the multiversion selector IFF loop is a multiversion fast loop Current CompileTask: C2:7789 1280 jdk.incubator.vector.ByteVector::ldLongOp (48 bytes) Stack: [0x00007f9ef7cfe000,0x00007f9ef7dfe000], sp=0x00007f9ef7df8560, free space=1001k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x1bcb7a4] VLoop::check_preconditions_helper() [clone .part.0]+0x824 (vectorization.cpp:141) V [libjvm.so+0x1bcba31] VLoop::check_preconditions()+0x41 (vectorization.cpp:41) V [libjvm.so+0x1573ea1] PhaseIdealLoop::auto_vectorize(IdealLoopTree*, VSharedData&)+0x241 (loopopts.cpp:4449) V [libjvm.so+0x155274d] PhaseIdealLoop::build_and_optimize()+0xfdd (loopnode.cpp:5270) [...] src/hotspot/share/opto/castnode.cpp line 35: > 33: #include "opto/subnode.hpp" > 34: #include "opto/type.hpp" > 35: #include "opto/loopnode.hpp" The new unsorted include now causes `sources/TestIncludesAreSorted.java` to fail. ------------- PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-3000973554 PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2194660384 From chagedorn at openjdk.org Wed Jul 9 10:37:52 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 9 Jul 2025 10:37:52 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v16] In-Reply-To: References: Message-ID: On Thu, 15 May 2025 10:30:13 GMT, Christian Hagedorn wrote: >> Otherwise, this assert: >> >> >> assert((1 << _reason_bits) >= Reason_LIMIT, "enough bits"); >> >> >> fails. Rather than tweak the allocation of bits to `_action_bits`, `_reason_bits`, `_debug_id_bits`, to extend `_reason_bits`, I thought it was simpler to have c2 and graal share the encoding of a reason given graal doesn't use the new `Reason_short_running_long_loop` and c2 doesn't use the jvmci specific `Reason_aliasing`. > > Makes sense, thanks for the explanation! I think this sharing is now causing problems with SA tests, for example with `serviceability/sa/TestPrintMdo.java`: stderr: [Exception in thread "main" java.lang.InternalError: duplicate reasons: aliasing short_running_long_loop at jdk.hotspot.agent/sun.jvm.hotspot.oops.MethodData.initialize(MethodData.java:181) at jdk.hotspot.agent/sun.jvm.hotspot.oops.MethodData$1.update(MethodData.java:128) at jdk.hotspot.agent/sun.jvm.hotspot.runtime.VM.registerVMInitializedObserver(VM.java:569) at jdk.hotspot.agent/sun.jvm.hotspot.oops.MethodData.(MethodData.java:126) [...] ] exitValue = 1 java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: Expected to get exit value of [0], exit value is: [1] at TestPrintMdo.main(TestPrintMdo.java:66) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:565) at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138) at java.base/java.lang.Thread.run(Thread.java:1474) Caused by: java.lang.RuntimeException: Expected to get exit value of [0], exit value is: [1] at jdk.test.lib.process.OutputAnalyzer.shouldHaveExitValue(OutputAnalyzer.java:522) at ClhsdbLauncher.runCmd(ClhsdbLauncher.java:148) at ClhsdbLauncher.run(ClhsdbLauncher.java:212) at TestPrintMdo.main(TestPrintMdo.java:62) ... 4 more ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2194657689 From bkilambi at openjdk.org Wed Jul 9 10:45:46 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 9 Jul 2025 10:45:46 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 01:23:43 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Disable auto-vectorization of double to short conversion for NEON and update tests Thanks for making the changes. Looks good to me. ------------- Marked as reviewed by bkilambi (Author). PR Review: https://git.openjdk.org/jdk/pull/26057#pullrequestreview-3001022247 From bkilambi at openjdk.org Wed Jul 9 11:08:59 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 9 Jul 2025 11:08:59 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v11] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: x86_64: JTREG test update for x86. The patch is contributed by @jatin-bhateja ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23570/files - new: https://git.openjdk.org/jdk/pull/23570/files/e86d55df..cec6f148 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=09-10 Stats: 46 lines in 3 files changed: 31 ins; 0 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From dlunden at openjdk.org Wed Jul 9 11:11:51 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Wed, 9 Jul 2025 11:11:51 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 09:17:13 GMT, Xiaohong Gong wrote: >> Hi @eme64 , could you please help take a look at this patch especially the test part since most of the tests are SLP related? It will be helpful if you could also help trigger a testing for it. Thanks for your time! > >> @XiaohongGong I would love to review and test, but I'm about to go on vacation and will only be back in August. I've pinged some others internally, and hope someone will pick this up! > > Thanks a lot for the help! Sounds good to me and have a good holiday! @XiaohongGong I'll run some tests and have a look at the changes as well (@eme64 asked me). I'll get back to you shortly! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3052222395 From bkilambi at openjdk.org Wed Jul 9 11:22:58 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 9 Jul 2025 11:22:58 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v12] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Addressed review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23570/files - new: https://git.openjdk.org/jdk/pull/23570/files/cec6f148..8025db0c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=10-11 Stats: 61 lines in 5 files changed: 14 ins; 9 del; 38 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From bkilambi at openjdk.org Wed Jul 9 11:40:29 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 9 Jul 2025 11:40:29 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Change match rule names to lowercase ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23570/files - new: https://git.openjdk.org/jdk/pull/23570/files/8025db0c..34566e7d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=11-12 Stats: 14 lines in 2 files changed: 0 ins; 0 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From thartmann at openjdk.org Wed Jul 9 12:21:41 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 9 Jul 2025 12:21:41 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 13:33:23 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Just use printf directly Marked as reviewed by thartmann (Reviewer). All tests passed. ------------- PR Review: https://git.openjdk.org/jdk/pull/26090#pullrequestreview-3001324494 PR Comment: https://git.openjdk.org/jdk/pull/26090#issuecomment-3052446627 From mchevalier at openjdk.org Wed Jul 9 12:36:31 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 9 Jul 2025 12:36:31 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v5] In-Reply-To: References: Message-ID: > A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. > > ## Pure Functions > > Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. > > ## Scope > > We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. > > ## Implementation Overview > > We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. > > This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Tentative to address Tobias' comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25760/files - new: https://git.openjdk.org/jdk/pull/25760/files/7f18c9f6..7553e307 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25760&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25760&range=03-04 Stats: 118 lines in 2 files changed: 29 ins; 38 del; 51 mod Patch: https://git.openjdk.org/jdk/pull/25760.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25760/head:pull/25760 PR: https://git.openjdk.org/jdk/pull/25760 From mchevalier at openjdk.org Wed Jul 9 12:36:32 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 9 Jul 2025 12:36:32 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v3] In-Reply-To: References: Message-ID: <785hwTKUD5k99LD_yX2GOLlWWk0-6fsQ3piigA_FIhs=.ed24d7c7-9ea6-4815-a738-1f61bade24e4@github.com> On Tue, 8 Jul 2025 15:16:16 GMT, Tobias Hartmann wrote: >> We could but it's not that direct. `ModFNode::Ideal` has 6 `return`s (without mine): >> - 3 are `return replace_with_con(...);` which in their turn return `nullptr` but after making changes in the graph. >> - 2 are `return nullptr;` >> - 1 is actually returning a node. >> And especially the final one is >> >> return replace_with_con(igvn, TypeF::make(jfloat_cast(xr))); >> >> If we change `replace_with_con` to actually return a `TupleNode` to do the job, we still have 2 places where to call the base class' `Ideal`. So I'm not sure how much better it would be to duplicate the call. It also adds a maintenance burden: if one adds another case where we don't want to make changes, one needs to add another call to `CallLeafPureNode::Ideal`. I think it's because of the structure of this function: rather than selecting cases where we want to do something and reaching the end with only the leftover cases, we select the cases we don't want to continue with, and we return early, making more cases where we should call the super method. >> >> I'll try something. > > Ah, makes sense. Feel free to leave as-is then. There, you can see the thing I've tried. It changes a bit more code, but overall, I think it makes it clearer and address your comment. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25760#discussion_r2194912203 From mhaessig at openjdk.org Wed Jul 9 12:36:46 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 9 Jul 2025 12:36:46 GMT Subject: RFR: 8360175: C2 crash: assert(edge_from_to(prior_use, n)) failed: before block local scheduling In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 00:40:39 GMT, Vladimir Kozlov wrote: >> The triggered assert is part of the schedule verification code that runs just before machine code is emitted. The debug output showed that a `leaPCompressedOopOffset` node was causing the assert, which suggested the peephole optimization introduced in #25471 as the cause. The failure proved quite difficult to reproduce. It failed more often on Windows and required `-XX:+UseKNLSetting` (forces code generation for Intel's Knights Landing platform), which forces `-XX:+OptoScheduling`. >> >> The root-cause is a subtle bug in the rewiring of the base edge of `leaP*` nodes in the `remove_redundant_lea` peephole. When the peephole removed a `decodeHeapOop_not_null` including a spill, it did not set the base edge of the `leaP*` node to the same node as the address edge, which is the intent of the peephole, but to the parent node of the spill. That is not catastrophic in most cases, but might reference another register slot, which causes this assert. Concretely, we see the following graph >> >> MemToRegSpillCopy >> | | >> | MemToRegSpillCopy >> | | >> DefiniinoSpillCopy | >> | | >> | decodeHeapOop_not_null >> | | >> leaPCompressedHeapOop >> >> gets rewired to >> >> MemToRegSpillCopy >> | | >> DefinitionSpillCopy | >> | | >> leaPCompressedHeapOop >> >> instead of >> >> MemToRegSpillCopy >> | >> DefinitionSpillCopy >> / \ >> leaPCompressedHeapOop >> >> >> This PR fixes this by always setting the base edge of the `leaP*` node to the same node as the address edge. Unfortunately, I was not able to construct a regression test because of the difficulty of reproducing the bug. >> >> # Testing >> >> - [x] Github Actions >> - [x] tier1,tier2 plus internal testing on all Oracle supported platforms >> - [x] tier3,tier4,tier5 plus internal testing on Linux and Windows x64 >> - [x] Runthese8H on `windows-x64-debug` (test that reliably produced the failure addressed in this PR) > > Seems fine. Thank you for your reviews, @vnkozlov and @chhagedorn! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26157#issuecomment-3052498669 From mhaessig at openjdk.org Wed Jul 9 12:36:47 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 9 Jul 2025 12:36:47 GMT Subject: Integrated: 8360175: C2 crash: assert(edge_from_to(prior_use,n)) failed: before block local scheduling In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 09:47:09 GMT, Manuel H?ssig wrote: > The triggered assert is part of the schedule verification code that runs just before machine code is emitted. The debug output showed that a `leaPCompressedOopOffset` node was causing the assert, which suggested the peephole optimization introduced in #25471 as the cause. The failure proved quite difficult to reproduce. It failed more often on Windows and required `-XX:+UseKNLSetting` (forces code generation for Intel's Knights Landing platform), which forces `-XX:+OptoScheduling`. > > The root-cause is a subtle bug in the rewiring of the base edge of `leaP*` nodes in the `remove_redundant_lea` peephole. When the peephole removed a `decodeHeapOop_not_null` including a spill, it did not set the base edge of the `leaP*` node to the same node as the address edge, which is the intent of the peephole, but to the parent node of the spill. That is not catastrophic in most cases, but might reference another register slot, which causes this assert. Concretely, we see the following graph > > MemToRegSpillCopy > | | > | MemToRegSpillCopy > | | > DefiniinoSpillCopy | > | | > | decodeHeapOop_not_null > | | > leaPCompressedHeapOop > > gets rewired to > > MemToRegSpillCopy > | | > DefinitionSpillCopy | > | | > leaPCompressedHeapOop > > instead of > > MemToRegSpillCopy > | > DefinitionSpillCopy > / \ > leaPCompressedHeapOop > > > This PR fixes this by always setting the base edge of the `leaP*` node to the same node as the address edge. Unfortunately, I was not able to construct a regression test because of the difficulty of reproducing the bug. > > # Testing > > - [x] Github Actions > - [x] tier1,tier2 plus internal testing on all Oracle supported platforms > - [x] tier3,tier4,tier5 plus internal testing on Linux and Windows x64 > - [x] Runthese8H on `windows-x64-debug` (test that reliably produced the failure addressed in this PR) This pull request has now been integrated. Changeset: db4b4a5b Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/db4b4a5b35a7664ddafed2817703ffd36a921fee Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8360175: C2 crash: assert(edge_from_to(prior_use,n)) failed: before block local scheduling Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/26157 From mchevalier at openjdk.org Wed Jul 9 12:38:26 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 9 Jul 2025 12:38:26 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v5] In-Reply-To: References: Message-ID: > When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 > > This is enforced by restoring the old state, like in > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 > > That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: > > ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) > > > Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. > > Another situation is somewhat worse, when happening during parsing. It can lead to such cases: > > ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) > > The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? > > This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 > > And here there is the challenge: > - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) > - we can't really change the pointer, just the content > -... Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: - Tentative addressing Vladimir's comments - Re-insert ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25936/files - new: https://git.openjdk.org/jdk/pull/25936/files/09b24ec4..59133778 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=03-04 Stats: 83 lines in 3 files changed: 12 ins; 29 del; 42 mod Patch: https://git.openjdk.org/jdk/pull/25936.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25936/head:pull/25936 PR: https://git.openjdk.org/jdk/pull/25936 From mchevalier at openjdk.org Wed Jul 9 12:40:42 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 9 Jul 2025 12:40:42 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v4] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 20:23:08 GMT, Vladimir Kozlov wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Somehow intellij doesn't remove empty indented line > > src/hotspot/share/opto/library_call.hpp line 150: > >> 148: void restore_state(const SavedState&); >> 149: void destruct_map_clone(const SavedState& sfp); >> 150: > > Can this be a class instead of struct? These methods could be members. Initialization can be done through constructor. The destructor can do restoration by default unless `destruct_map_clone()` was called before. > I don't like name `destruct_map_clone()` for this. How about `SavedState::remove()` or something. I like it. Since intrinsic implementations have mostly bailing out returns, and few success paths, it's nice to say when we are good, rather than every path that ends with bailing out. I've called the member function `discard`. It gives as `old_state.discard()`, which reads well, I think. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2194921677 From shade at openjdk.org Wed Jul 9 12:45:47 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Jul 2025 12:45:47 GMT Subject: RFR: 8361255: CTW: Tolerate more NCDFE problems [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 13:33:23 GMT, Aleksey Shipilev wrote: >> We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. >> >> The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. >> >> Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): >> >> >> Before: Done (2487 classes, 9866 methods, 24584 ms) >> After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods >> >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Just use printf directly Thank you! Here goes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26090#issuecomment-3052522900 From shade at openjdk.org Wed Jul 9 12:45:48 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Jul 2025 12:45:48 GMT Subject: Integrated: 8361255: CTW: Tolerate more NCDFE problems In-Reply-To: References: Message-ID: <-DDewlGSacq_GgjqWaecDzvaMrI97_wOvRZXdlYAcTI=.2653dc31-b006-459e-a956-040517e1e040@github.com> On Wed, 2 Jul 2025 10:14:36 GMT, Aleksey Shipilev wrote: > We routinely CTW 3rd party JARs to make sure our compilers work. By the nature of the JARs, they have dependencies on other JARs, and CTW runner frequently warns out with NCDFE. It does so very crudely, missing opportunities to compile the methods that _do not_ trigger NCDFEs. CTW should be made more tolerant to this. I think the normal "modules" CTW runs into the similar problem, but on a lesser scale, as we do not have a very hairy dependency graph within JDK. > > The CTW logs are also fairly noisy with full exception traces when NCDFE is semi-expected. This PR does _not_ print exception stack traces in these cases, only "NOTE"-s about it. This makes the log fairly clean and more understandable. > > Motivational scope improvement compiling a sample 3rd party JAR (cassandra-2.1.4.0.jar): > > > Before: Done (2487 classes, 9866 methods, 24584 ms) > After: Done (2487 classes, 10074 methods, 24150 ms) ; +2% more methods > > > Additional testing: > - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` This pull request has now been integrated. Changeset: a201be85 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/a201be8555c57f07b86f470df4699e1b9dd6bd3c Stats: 45 lines in 2 files changed: 35 ins; 0 del; 10 mod 8361255: CTW: Tolerate more NCDFE problems Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/26090 From thartmann at openjdk.org Wed Jul 9 12:47:41 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 9 Jul 2025 12:47:41 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v5] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 12:38:26 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: > > - Tentative addressing Vladimir's comments > - Re-insert Nice! Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25936#pullrequestreview-3001409938 From thartmann at openjdk.org Wed Jul 9 13:11:44 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 9 Jul 2025 13:11:44 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v5] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 12:36:31 GMT, Marc Chevalier wrote: >> A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. >> >> ## Pure Functions >> >> Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. >> >> ## Scope >> >> We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. >> >> ## Implementation Overview >> >> We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. >> >> This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Tentative to address Tobias' comments Thanks for making these changes, I like that version more. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25760#pullrequestreview-3001503959 From eastigeevich at openjdk.org Wed Jul 9 13:24:59 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 9 Jul 2025 13:24:59 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v6] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Update spin_wait block comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/e3163c9f..6d60fb42 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=04-05 Stats: 11 lines in 2 files changed: 5 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From eastigeevich at openjdk.org Wed Jul 9 13:24:59 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 9 Jul 2025 13:24:59 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v4] In-Reply-To: References: <3-UMhQGbb3psk7_pn0BC1SJNrKtOfZWphO_V9d9Bqz8=.09589b1a-fb0d-4150-95f4-7565d32ed7b1@github.com> Message-ID: <_Skd1OxarrQAHIh5P0SK7IYYVAG5nRvbAxRSX8Uqldw=.7eee567a-f242-4987-b50b-f305ad001b32@github.com> On Wed, 9 Jul 2025 08:27:46 GMT, Andrew Haley wrote: >> @theRealAph, are you okay with the latest variant Aleksey proposes? > >> But this is a generic macroAssembler block comment. It does not make sense to me to have a block comment that looks like a memory corrupted string, just to satisfy a single test. There should be a middle-ground here, e.g. `spin_wait {`, which looks reasonably enough as the block comment, and not anything else. Would also match nicely when we emit the closing `}` at the end of the instruction block. > > OK I update the PR to use `spin_wait { ... }` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2195023640 From fyang at openjdk.org Wed Jul 9 13:51:41 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 9 Jul 2025 13:51:41 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v4] In-Reply-To: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> References: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> Message-ID: On Wed, 9 Jul 2025 10:07:31 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Revert RISCV Macro modification > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses > - riscv: fix c1 primitive array clone intrinsic regression Thanks for the update. Still looks reasonable to me. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25976#pullrequestreview-3001655252 From shade at openjdk.org Wed Jul 9 14:51:44 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Jul 2025 14:51:44 GMT Subject: RFR: 8357473: Compilation spike leaves many CompileTasks in free list [v6] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 08:59:53 GMT, Aleksey Shipilev wrote: >> See bug for more discussion. >> >> This PR implements the "all the way" solution by removing the free list completely. It complements https://github.com/openjdk/jdk/pull/25364, and can go either first, or second. We will remerge the other one once either integrates. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler` >> - [x] Linux AArch64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: > > - Task count atomic can be relaxed > - Minor touchup in ~CompileTask > - Purge CompileTaskAlloc_lock completely > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Merge branch 'master' into JDK-8357473-compile-task-free-list > - Also free the lock! > - Comments and indenting > - ... and 1 more: https://git.openjdk.org/jdk/compare/7b255b8a...684f83b7 Thank you! Here we go. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25409#issuecomment-3052951178 From shade at openjdk.org Wed Jul 9 14:51:45 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Jul 2025 14:51:45 GMT Subject: Integrated: 8357473: Compilation spike leaves many CompileTasks in free list In-Reply-To: References: Message-ID: On Fri, 23 May 2025 09:42:17 GMT, Aleksey Shipilev wrote: > See bug for more discussion. > > This PR implements the "all the way" solution by removing the free list completely. It complements https://github.com/openjdk/jdk/pull/25364, and can go either first, or second. We will remerge the other one once either integrates. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler` > - [x] Linux AArch64 server fastdebug, `all` This pull request has now been integrated. Changeset: a41d3507 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/a41d35073ee6da0dde4dd731c1ab4c25245d075a Stats: 134 lines in 6 files changed: 27 ins; 71 del; 36 mod 8357473: Compilation spike leaves many CompileTasks in free list Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/25409 From eastigeevich at openjdk.org Wed Jul 9 15:54:41 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 9 Jul 2025 15:54:41 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v6] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 13:24:59 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Update spin_wait block comment Two MacOS tests failed due to time out. This is not related to my change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3053162368 From eastigeevich at openjdk.org Wed Jul 9 15:58:02 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 9 Jul 2025 15:58:02 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v7] In-Reply-To: References: Message-ID: <7dMf98HrEGETw9M_cdzOF8Mmc3hVC7kdm-oiHzXImok=.8b7aedf3-b581-4f0e-b636-b3409f3f5d33@github.com> > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. Evgeny Astigeevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Merge branch 'master' into JDK-8360936 - Update spin_wait block comment - Fix whitespace error - Implement using block_comment - Reimplement checking algo without using debug info - Simplify requirement for debug build - 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/6d60fb42..8554242b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=05-06 Stats: 14617 lines in 539 files changed: 8622 ins; 2314 del; 3681 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From shade at openjdk.org Wed Jul 9 15:59:03 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Jul 2025 15:59:03 GMT Subject: RFR: 8231269: CompileTask::is_unloaded is slow due to JNIHandles type checks [v22] In-Reply-To: References: Message-ID: > [JDK-8163511](https://bugs.openjdk.org/browse/JDK-8163511) made the `CompileTask` improvement to avoid blocking class unloading if a relevant compile task is in queue. Current code does a sleight-of-hand to make sure the the `method*` in `CompileTask` are still valid before using them. Still a noble goal, so we keep trying to do this. > > The code tries to switch weak JNI handle with a strong one when it wants to capture the holder to block unloading. Since we are reusing the same field, we have to do type checks like `JNIHandles::is_weak_global_handle(_method_holder)`. Unfortunately, that type-check goes all the way to `OopStorage` allocation code to verify the handle is really allocated in the relevant `OopStorage`. This takes internal `OopStorage` locks, and thus is slow. > > This issue is clearly visible in Leyden, when there are lots of `CompileTask`-s in the queue, dumped by AOT code loader. It also does not help that `CompileTask::select_task` is effectively quadratic in number of methods in queue, so we end up calling `CompileTask::is_unloaded` very often. > > It is possible to mitigate this issue by splitting the related fields into weak and strong ones. But as Kim mentions in the bug, we should not be using JNI handles here at all, and instead go directly for relevant `OopStorage`-s. This is what this PR does, among other things that should hopefully make the whole mechanics clearer. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler/classUnloading`, 100x still passes; these tests are sensitive to bugs in this code > - [x] Linux x86_64 server fastdebug, `all` > - [x] Linux AArch64 server fastdebug, `all` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: - Merge branch 'master' into JDK-8231269-compile-task-weaks - Merge branch 'master' into JDK-8231269-compile-task-weaks - Switch to mutable - Merge branch 'master' into JDK-8231269-compile-task-weaks - More touchups - Spin lock induces false sharing - Merge branch 'master' into JDK-8231269-compile-task-weaks - Merge branch 'master' into JDK-8231269-compile-task-weaks - Rename CompilerTask::is_unloaded back to avoid losing comment context - Simplify select_for_compilation - ... and 27 more: https://git.openjdk.org/jdk/compare/a41d3507...d5a8a27d ------------- Changes: https://git.openjdk.org/jdk/pull/24018/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24018&range=21 Stats: 430 lines in 13 files changed: 389 ins; 21 del; 20 mod Patch: https://git.openjdk.org/jdk/pull/24018.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24018/head:pull/24018 PR: https://git.openjdk.org/jdk/pull/24018 From kvn at openjdk.org Wed Jul 9 16:36:39 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 9 Jul 2025 16:36:39 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v5] In-Reply-To: References: Message-ID: <06SLl1SEu1oWp3YWh9xR3LbQR5rDGs6thA5UAS_kMtk=.5fb6e090-7ac6-4e99-9700-b767f2a08348@github.com> On Wed, 9 Jul 2025 12:38:26 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: > > - Tentative addressing Vladimir's comments > - Re-insert src/hotspot/share/opto/library_call.hpp line 147: > 145: SafePointNode* _map; > 146: Unique_Node_List _ctrl_succ; > 147: bool discarded = false; `discarded` is not static field. I suggest to initialize it in constructor. And use `_` prefix. Otherwise changes are good. Thank you for taking my suggestion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2195470155 From shade at openjdk.org Wed Jul 9 17:07:44 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Jul 2025 17:07:44 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v7] In-Reply-To: <7dMf98HrEGETw9M_cdzOF8Mmc3hVC7kdm-oiHzXImok=.8b7aedf3-b581-4f0e-b636-b3409f3f5d33@github.com> References: <7dMf98HrEGETw9M_cdzOF8Mmc3hVC7kdm-oiHzXImok=.8b7aedf3-b581-4f0e-b636-b3409f3f5d33@github.com> Message-ID: On Wed, 9 Jul 2025 15:58:02 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Merge branch 'master' into JDK-8360936 > - Update spin_wait block comment > - Fix whitespace error > - Implement using block_comment > - Reimplement checking algo without using debug info > - Simplify requirement for debug build > - 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 Good patch. I propose a few cosmetics: [8360936-cosmetics-1.patch.txt](https://github.com/user-attachments/files/21147228/8360936-cosmetics-1.patch.txt) -- easier to express them as patch. Untested, see if it makes sense? ------------- PR Review: https://git.openjdk.org/jdk/pull/26072#pullrequestreview-3002334781 From kvn at openjdk.org Wed Jul 9 17:59:51 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 9 Jul 2025 17:59:51 GMT Subject: [jdk25] RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() Message-ID: Hi all, This pull request contains a backport of commit [dedcce04](https://github.com/openjdk/jdk/commit/dedcce045013b3ff84f5ef8857e1a83f0c09f9ad) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. The commit being backported was authored by Vladimir Kozlov on 8 Jul 2025 and was reviewed by Andrew Dinn and Matthias Baesken. Thanks! ------------- Commit messages: - Backport dedcce045013b3ff84f5ef8857e1a83f0c09f9ad Changes: https://git.openjdk.org/jdk/pull/26223/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26223&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360942 Stats: 7 lines in 2 files changed: 4 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/26223.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26223/head:pull/26223 PR: https://git.openjdk.org/jdk/pull/26223 From shade at openjdk.org Wed Jul 9 18:03:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 9 Jul 2025 18:03:39 GMT Subject: [jdk25] RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: <_O_7iijXBZbsQIvNrGqgO5usw1YHeqaG4WfzIKXR5_c=.2f13a8ad-354c-4eb3-88e3-db6e1d660391@github.com> On Wed, 9 Jul 2025 17:55:31 GMT, Vladimir Kozlov wrote: > Hi all, > > This pull request contains a backport of commit [dedcce04](https://github.com/openjdk/jdk/commit/dedcce045013b3ff84f5ef8857e1a83f0c09f9ad) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Vladimir Kozlov on 8 Jul 2025 and was reviewed by Andrew Dinn and Matthias Baesken. > > Thanks! Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26223#pullrequestreview-3002595613 From kbarrett at openjdk.org Wed Jul 9 19:15:44 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 9 Jul 2025 19:15:44 GMT Subject: RFR: 8231269: CompileTask::is_unloaded is slow due to JNIHandles type checks [v19] In-Reply-To: References: Message-ID: On Mon, 26 May 2025 18:58:42 GMT, Aleksey Shipilev wrote: > > Not sure what our opinion is w.r.t. `mutable`, but how do we feel about typing the spin lock as `mutable` and keep `is_safe()` and `method*()` const. > > I like this a lot! Dropping `const` just to satisfy spin lock (an implementation detail) felt really awkward. New version uses `mutable`. Just a drive-by reply. `mutable` is a C++98 (and before, I think) feature, with many uses in HotSpot. Using it here seems fine to me. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24018#issuecomment-3053728810 From kbarrett at openjdk.org Wed Jul 9 19:29:45 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 9 Jul 2025 19:29:45 GMT Subject: RFR: 8231269: CompileTask::is_unloaded is slow due to JNIHandles type checks [v22] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 15:59:03 GMT, Aleksey Shipilev wrote: >> [JDK-8163511](https://bugs.openjdk.org/browse/JDK-8163511) made the `CompileTask` improvement to avoid blocking class unloading if a relevant compile task is in queue. Current code does a sleight-of-hand to make sure the the `method*` in `CompileTask` are still valid before using them. Still a noble goal, so we keep trying to do this. >> >> The code tries to switch weak JNI handle with a strong one when it wants to capture the holder to block unloading. Since we are reusing the same field, we have to do type checks like `JNIHandles::is_weak_global_handle(_method_holder)`. Unfortunately, that type-check goes all the way to `OopStorage` allocation code to verify the handle is really allocated in the relevant `OopStorage`. This takes internal `OopStorage` locks, and thus is slow. >> >> This issue is clearly visible in Leyden, when there are lots of `CompileTask`-s in the queue, dumped by AOT code loader. It also does not help that `CompileTask::select_task` is effectively quadratic in number of methods in queue, so we end up calling `CompileTask::is_unloaded` very often. >> >> It is possible to mitigate this issue by splitting the related fields into weak and strong ones. But as Kim mentions in the bug, we should not be using JNI handles here at all, and instead go directly for relevant `OopStorage`-s. This is what this PR does, among other things that should hopefully make the whole mechanics clearer. >> >> Additional testing: >> - [x] Linux x86_64 server fastdebug, `compiler/classUnloading`, 100x still passes; these tests are sensitive to bugs in this code >> - [x] Linux x86_64 server fastdebug, `all` >> - [x] Linux AArch64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: > > - Merge branch 'master' into JDK-8231269-compile-task-weaks > - Merge branch 'master' into JDK-8231269-compile-task-weaks > - Switch to mutable > - Merge branch 'master' into JDK-8231269-compile-task-weaks > - More touchups > - Spin lock induces false sharing > - Merge branch 'master' into JDK-8231269-compile-task-weaks > - Merge branch 'master' into JDK-8231269-compile-task-weaks > - Rename CompilerTask::is_unloaded back to avoid losing comment context > - Simplify select_for_compilation > - ... and 27 more: https://git.openjdk.org/jdk/compare/a41d3507...d5a8a27d src/hotspot/share/oops/unloadableMethodHandle.hpp line 81: > 79: friend class VMStructs; > 80: private: > 81: enum State { Not really a review, just a drive-by comment. I think the only argument against using an enum class here is the lack of C++20's "using enums" feature: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1099r5.html Personally I'd prefer to just make it an enum class and scope the references. YMMV. Also, someday we should try to come to some consensus about the naming of constants. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24018#discussion_r2195817179 From kbarrett at openjdk.org Wed Jul 9 19:32:45 2025 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 9 Jul 2025 19:32:45 GMT Subject: RFR: 8231269: CompileTask::is_unloaded is slow due to JNIHandles type checks [v11] In-Reply-To: References: Message-ID: On Wed, 7 May 2025 20:31:21 GMT, Coleen Phillimore wrote: > This is a cleaner way to do this. I believe it's what we discussed with Kim. He can confirm. Yes, I think this looks like the sort of thing I had in mind when we were discussing it back whenever that was. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24018#issuecomment-3053768403 From eastigeevich at openjdk.org Wed Jul 9 19:54:40 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 9 Jul 2025 19:54:40 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v7] In-Reply-To: References: <7dMf98HrEGETw9M_cdzOF8Mmc3hVC7kdm-oiHzXImok=.8b7aedf3-b581-4f0e-b636-b3409f3f5d33@github.com> Message-ID: <4xSNmbHsRlDFZHA09hEeaMqLooxhpblZzRSf2F0RTF8=.f1c45af3-5527-445e-9b91-4180c0c08fad@github.com> On Wed, 9 Jul 2025 17:05:05 GMT, Aleksey Shipilev wrote: > Good patch. I propose a few cosmetics: [8360936-cosmetics-1.patch.txt](https://github.com/user-attachments/files/21147228/8360936-cosmetics-1.patch.txt) -- easier to express them as patch. Untested, see if it makes sense? Thank you! It make sense. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3053816693 From eastigeevich at openjdk.org Wed Jul 9 20:13:57 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 9 Jul 2025 20:13:57 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v8] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Apply 8360936-cosmetics-1.patch.txt from PR ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/8554242b..1b4d81be Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=06-07 Stats: 75 lines in 1 file changed: 25 ins; 38 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From eastigeevich at openjdk.org Wed Jul 9 20:13:57 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Wed, 9 Jul 2025 20:13:57 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v7] In-Reply-To: <4xSNmbHsRlDFZHA09hEeaMqLooxhpblZzRSf2F0RTF8=.f1c45af3-5527-445e-9b91-4180c0c08fad@github.com> References: <7dMf98HrEGETw9M_cdzOF8Mmc3hVC7kdm-oiHzXImok=.8b7aedf3-b581-4f0e-b636-b3409f3f5d33@github.com> <4xSNmbHsRlDFZHA09hEeaMqLooxhpblZzRSf2F0RTF8=.f1c45af3-5527-445e-9b91-4180c0c08fad@github.com> Message-ID: On Wed, 9 Jul 2025 19:52:11 GMT, Evgeny Astigeevich wrote: > > Good patch. I propose a few cosmetics: [8360936-cosmetics-1.patch.txt](https://github.com/user-attachments/files/21147228/8360936-cosmetics-1.patch.txt) -- easier to express them as patch. Untested, see if it makes sense? > > Thank you! It make sense. I tested it. It works as expected. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26072#issuecomment-3053865689 From dlong at openjdk.org Thu Jul 10 00:09:50 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 10 Jul 2025 00:09:50 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 00:29:58 GMT, Chad Rakoczy wrote: >> The original motivation was to address far call sites. After relocation, some calls that previously didn't require a trampoline might now need one, hence the introduction of the `be_safe` parameter. However, upon further review, this change is unnecessary. The method `trampoline_stub_Relocation::fix_relocation_after_move` already updates the owner and contains the logic to determine whether a direct call can be performed. Therefore, we can skip invoking `CallRelocation::fix_relocation_after_move` for calls that use trampolines, as all required adjustments will be handled correctly by the trampoline relocations. ([Reference](https://github.com/chadrako/jdk/blob/0f4ff9646d1f7f43214c5ccd4bbe572fffd08d16/src/hotspot/share/code/nmethod.cpp#L1547-L1556)) > > @dean-long What are your thoughts on this solution? The logic looks fine, but I don't think it belongs in shared code. Why not have a new fix_relocation_after_xxx() that is platform-specific? For most platforms it can just delegate to fix_relocation_after_move(). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2196224362 From xgong at openjdk.org Thu Jul 10 01:42:40 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 10 Jul 2025 01:42:40 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 09:17:13 GMT, Xiaohong Gong wrote: >> Hi @eme64 , could you please help take a look at this patch especially the test part since most of the tests are SLP related? It will be helpful if you could also help trigger a testing for it. Thanks for your time! > >> @XiaohongGong I would love to review and test, but I'm about to go on vacation and will only be back in August. I've pinged some others internally, and hope someone will pick this up! > > Thanks a lot for the help! Sounds good to me and have a good holiday! > @XiaohongGong I'll run some tests and have a look at the changes as well (@eme64 asked me). I'll get back to you shortly! Thanks so much for your help! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3054907351 From xgong at openjdk.org Thu Jul 10 01:42:41 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 10 Jul 2025 01:42:41 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 10:43:07 GMT, Bhavana Kilambi wrote: > Thanks for making the changes. Looks good to me. Thanks a lot for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3054908101 From never at openjdk.org Thu Jul 10 01:43:58 2025 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 10 Jul 2025 01:43:58 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v35] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 20:03:17 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: > > - Typo > - Merge branch 'master' into JDK-8316694-Final > - Update justification for skipping CallRelocation > - Enclose ImmutableDataReferencesCounterSize in parentheses > - Let trampolines fix their owners > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - ... and 85 more: https://git.openjdk.org/jdk/compare/117f0b40...66d73c16 Marked as reviewed by never (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23573#pullrequestreview-3003593036 From xgong at openjdk.org Thu Jul 10 02:04:54 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 10 Jul 2025 02:04:54 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 11:40:29 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Change match rule names to lowercase src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2886: > 2884: if (bt == T_BYTE) { > 2885: if (isQ) { > 2886: assert(UseSVE <= 1, "sve must be <= 1"); This assertion is not necessary as there is the same assertion in above line-2866? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2919: > 2917: ins(tmp, D, src2, 1, 0); > 2918: tbl(dst, size1, tmp, 1, dst); > 2919: } Is it better than we wrap this part as a help function, because the code is much the same with line2885-2898? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2196316568 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2196340944 From duke at openjdk.org Thu Jul 10 02:11:21 2025 From: duke at openjdk.org (Guanqiang Han) Date: Thu, 10 Jul 2025 02:11:21 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v2] In-Reply-To: References: Message-ID: > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - update modification and add regression test - Merge remote-tracking branch 'upstream/master' into 8361140 - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26125/files - new: https://git.openjdk.org/jdk/pull/26125/files/f118400d..2feca6a8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26125&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26125&range=00-01 Stats: 10836 lines in 397 files changed: 5987 ins; 1650 del; 3199 mod Patch: https://git.openjdk.org/jdk/pull/26125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26125/head:pull/26125 PR: https://git.openjdk.org/jdk/pull/26125 From amitkumar at openjdk.org Thu Jul 10 03:17:18 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 10 Jul 2025 03:17:18 GMT Subject: RFR: 8361536: [s390x] Saving return_pc at wrong offset Message-ID: Fixes the bug where return pc was stored at a wrong offset, which causes issue with java abi. Issue appeared in #26004, see the comment: https://github.com/openjdk/jdk/pull/26004#issuecomment-3017928879. ------------- Commit messages: - Revert "save another 8 bytes" - save another 8 bytes - fix Changes: https://git.openjdk.org/jdk/pull/26209/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26209&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361536 Stats: 20 lines in 1 file changed: 2 ins; 0 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/26209.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26209/head:pull/26209 PR: https://git.openjdk.org/jdk/pull/26209 From lucy at openjdk.org Thu Jul 10 03:17:18 2025 From: lucy at openjdk.org (Lutz Schmidt) Date: Thu, 10 Jul 2025 03:17:18 GMT Subject: RFR: 8361536: [s390x] Saving return_pc at wrong offset In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 05:24:38 GMT, Amit Kumar wrote: > Fixes the bug where return pc was stored at a wrong offset, which causes issue with java abi. > > Issue appeared in #26004, see the comment: https://github.com/openjdk/jdk/pull/26004#issuecomment-3017928879. LGTM. Thanks for fixing. ------------- Marked as reviewed by lucy (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26209#pullrequestreview-3000979208 From mdoerr at openjdk.org Thu Jul 10 03:17:18 2025 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 10 Jul 2025 03:17:18 GMT Subject: RFR: 8361536: [s390x] Saving return_pc at wrong offset In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 05:24:38 GMT, Amit Kumar wrote: > Fixes the bug where return pc was stored at a wrong offset, which causes issue with java abi. > > Issue appeared in #26004, see the comment: https://github.com/openjdk/jdk/pull/26004#issuecomment-3017928879. We're not really saving space. We just use less of the caller allocated stack space which is still as large as before. But the change looks good and should make the stack walking code happy, because it can find the return_pc where it is expected, now. ------------- Marked as reviewed by mdoerr (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26209#pullrequestreview-3001829975 From amitkumar at openjdk.org Thu Jul 10 03:17:18 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 10 Jul 2025 03:17:18 GMT Subject: RFR: 8361536: [s390x] Saving return_pc at wrong offset In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 05:24:38 GMT, Amit Kumar wrote: > Fixes the bug where return pc was stored at a wrong offset, which causes issue with java abi. > > Issue appeared in #26004, see the comment: https://github.com/openjdk/jdk/pull/26004#issuecomment-3017928879. Fast debug build was fine, but release build crashed with this error: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x000003fffd16b19e, pid=281849, tid=281855 # # JRE version: OpenJDK Runtime Environment (26.0) (build 26-internal-adhoc.amit.jdk) # Java VM: OpenJDK 64-Bit Server VM (26-internal-adhoc.amit.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-s390x) # Problematic frame: # V [libjvm.so+0x66b19e] HandleMark::~HandleMark()+0x1e # # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -F%F -- %E" (or dumping to /home/amit/jdk/core.281849) # # If you would like to submit a bug report, please visit: # https://bugreport.java.com/bugreport/crash.jsp # stack trace: Stack: [0x000003fffc900000,0x000003fffca00000], sp=0x000003fffc9fca40, free space=1010k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x66b19e] HandleMark::~HandleMark()+0x1e (handles.inline.hpp:88) V [libjvm.so+0xc1a038] Threads::create_vm(JavaVMInitArgs*, bool*)+0x528 (threads.cpp:905) V [libjvm.so+0x799e9a] JNI_CreateJavaVM+0x7a (jni.cpp:3589) C [libjli.so+0x40e0] JavaMain+0xa0 (java.c:1506) C [libjli.so+0x8170] ThreadJavaMain+0x20 (java_md.c:646) This commit (https://github.com/openjdk/jdk/pull/26209/commits/e945e0460832cf25dbbaba351b89c1cade4fefa1) seems to be faulty. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26209#issuecomment-3052219868 From xgong at openjdk.org Thu Jul 10 03:18:45 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 10 Jul 2025 03:18:45 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 01:58:23 GMT, Xiaohong Gong wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Change match rule names to lowercase > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2919: > >> 2917: ins(tmp, D, src2, 1, 0); >> 2918: tbl(dst, size1, tmp, 1, dst); >> 2919: } > > Is it better than we wrap this part as a help function, because the code is much the same with line2885-2898? These two functions can be refined more clearly. Following is my version: void C2_MacroAssembler::select_from_two_vectors_neon(FloatRegister dst, FloatRegister src1, FloatRegister src2, FloatRegister index, FloatRegister tmp, bool isQ) { assert_different_registers(dst, src1, src2, tmp); assert(bt != T_DOUBLE && bt != T_LONG, "unsupported basic type"); if (isQ) { assert(UseSVE <= 1, "sve must be <= 1"); // If the vector length is 16B, then use the Neon "tbl" instruction with two vector table tbl(dst, size1, src1, 2, index); } else { // vector length == 8 assert(UseSVE == 0, "must be Neon only"); // We need to fit both the source vectors (src1, src2) in a 128-bit register because the // Neon "tbl" instruction supports only looking up 16B vectors. We then use the Neon "tbl" // instruction with one vector lookup ins(tmp, D, src1, 0, 0); ins(tmp, D, src2, 1, 0); tbl(dst, size1, tmp, 1, index); } } void C2_MacroAssembler::select_from_two_vectors_sve(FloatRegister dst, FloatRegister src1, FloatRegister src2, FloatRegister index, FloatRegister tmp, BasicType bt, unsigned length_in_bytes) { assert_different_registers(dst, src1, src2, index, tmp); SIMD_RegVariant T = elemType_to_regVariant(bt); if (length_in_bytes == 8) { assert(UseSVE >= 1, "sve must be >= 1"); ins(tmp, D, src1, 0, 0); ins(tmp, D, src2, 1, 0); sve_tbl(dst, T, tmp, index); } else { // UseSVE == 2 and vector_length_in_bytes > 8 assert(UseSVE == 2, "must be sve2"); sve_tbl(dst, T, src1, src2, index); } } void C2_MacroAssembler::select_from_two_vectors(FloatRegister dst, FloatRegister src1, FloatRegister src2, FloatRegister index, FloatRegister tmp, BasicType bt, unsigned length_in_bytes) { assert_different_registers(dst, src1, src2, index, tmp); if (UseSVE == 2 || (UseSVE == 1 && length_in_bytes == 8)) { select_from_two_vectors_sve(dst, src1, src2, index, tmp, bt, length_in_bytes); return; } // The only BasicTypes that can reach here are T_SHORT, T_BYTE, T_INT and T_FLOAT assert(bt != T_DOUBLE && bt != T_LONG, "unsupported basic type"); assert(length_in_bytes <= 16, "length_in_bytes must be <= 16"); SIMD_Arrangement size1 = isQ ? T16B : T8B; SIMD_Arrangement size2 = esize2arrangement((uint)type2aelembytes(bt), isQ); // Neon "tbl" instruction only supports byte tables, so we need to look at chunks of // 2B for selecting shorts or chunks of 4B for selecting ints/floats from the table. // The index values in "index" register are in the range of [0, 2 * NUM_ELEM) where NUM_ELEM // is the number of elements that can fit in a vector. For ex. for T_SHORT with 64-bit vector length, // the indices can range from [0, 8). // As an example with 64-bit vector length and T_SHORT type - let index = [2, 5, 1, 0] // Move a constant 0x02 in every byte of tmp - tmp = [0x0202, 0x0202, 0x0202, 0x0202] // Multiply index vector with tmp to yield - dst = [0x0404, 0x0a0a, 0x0202, 0x0000] // Move a constant 0x0100 in every 2B of tmp - tmp = [0x0100, 0x0100, 0x0100, 0x0100] // Add the multiplied result to the vector in tmp to obtain the byte level // offsets - dst = [0x0504, 0x0b0a, 0x0302, 0x0100] // Use these offsets in the "tbl" instruction to select chunks of 2B. if (bt == T_BYTE) { select_from_two_vectors_neon(dst, src1, src2, index, tmp, isQ); } else { int elem_size = (bt == T_SHORT) ? 2 : 4; uint64_t tbl_offset = (bt == T_SHORT) ? 0x0100u : 0x03020100u; mov(tmp, size1, elem_size); mulv(dst, size2, index, tmp); mov(tmp, size2, tbl_offset); addv(dst, size1, dst, tmp); // "dst" now contains the processed index elements // to select a set of 2B/4B select_from_two_vectors_neon(dst, src1, src2, dst, tmp, isQ); } } 1) Current match rules of `vselect_from_two_vectors_neon_..` and `vselect_from_two_vectors_sve_...` can be combined by calling the same function `select_from_two_vectors()` , as the registers are totally the same. This can save half of new added rules. 2) `select_from_two_vectors_sve` and `select_from_two_vectors_neon` can be two helper functions which should be `private` of `C2_MacroAssembler`. 3) There are some cases that do not need `tmp` register: - UseSVE <= 1 && bt == T_BYTE && length_in_bytes == 16 - UseSVE == 2 && length_in_bytes == MaxVectorSize For these cases, maybe we have to separate the rules with those need `tmp` register. This can save a float register. If this will make the code more complex and unreadable, I'm also fine with noting spliting them. WDYT? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2196420133 From amitkumar at openjdk.org Thu Jul 10 03:38:40 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 10 Jul 2025 03:38:40 GMT Subject: RFR: 8361536: [s390x] Saving return_pc at wrong offset In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 14:35:30 GMT, Martin Doerr wrote: >> Fixes the bug where return pc was stored at a wrong offset, which causes issue with java abi. >> >> Issue appeared in #26004, see the comment: https://github.com/openjdk/jdk/pull/26004#issuecomment-3017928879. > > We're not really saving space. We just use less of the caller allocated stack space which is still as large as before. > But the change looks good and should make the stack walking code happy, because it can find the return_pc where it is expected, now. @TheRealMDoerr I need reapproval. Can you provide one ? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26209#issuecomment-3055233107 From amitkumar at openjdk.org Thu Jul 10 03:40:43 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 10 Jul 2025 03:40:43 GMT Subject: RFR: 8358756: [s390x] Test StartupOutput.java crash due to CodeCache size [v2] In-Reply-To: <5EregFPpep4Y8cL1v_GnvR6vq415jVK-u_6MuCPNfm4=.8b2ac0ec-47fe-4f50-9815-a61ea400f58f@github.com> References: <5EregFPpep4Y8cL1v_GnvR6vq415jVK-u_6MuCPNfm4=.8b2ac0ec-47fe-4f50-9815-a61ea400f58f@github.com> Message-ID: On Tue, 17 Jun 2025 15:09:42 GMT, Damon Fenacci wrote: >>> Thanks @offamitkumar. The idea behind the [PR](https://github.com/openjdk/jdk/pull/23630) that changed this is that it would check randomly around the amount of code cache that would be just enough for the compilers to start (or not). So, before that PR it would sometimes crash instead of terminating gently. Does adding `800k` to the initial code cache for s390 do that? Did you try before that [PR](https://github.com/openjdk/jdk/pull/23630) (or temporarily reverting it) to see if it crashes? >> >> >> Just for my understanding. Even if test passes we still want to see this warning: >> >> [warning][codecache] CodeCache is full. Compiler has been disabled. >> >> >> Before the PR, I don't test crashing or even producing this warning. Even with my changes same behaviour is going on. > >> Just for my understanding. Even if test passes we still want to see this warning: >> >> ``` >> [warning][codecache] CodeCache is full. Compiler has been disabled. >> ``` > > The test passes with and without that message. When the randomly chosen amount of code cache is not enough to start the compiler(s) it should print that message, when it is enough to start both compilers, you don't see that message. > The important thing is that there is no crash when compilers are trying to reserve code cache (they should be just shut down). @dafedafe any further comments on this one :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/25741#issuecomment-3055235851 From duke at openjdk.org Thu Jul 10 05:09:39 2025 From: duke at openjdk.org (Guanqiang Han) Date: Thu, 10 Jul 2025 05:09:39 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v2] In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 12:11:46 GMT, Christian Hagedorn wrote: >> Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - update modification and add regression test >> - Merge remote-tracking branch 'upstream/master' into 8361140 >> - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp >> >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. > > src/hotspot/share/opto/escape.cpp line 981: > >> 979: if (!OptimizePtrCompare) { >> 980: return; >> 981: } > > Thanks for working on this! IIUC, having the bailout here will fail to reduce the phi which could be unexpected. Shouldn't we just return `UNKNOWN` from within `ConnectionGraph::optimize_ptr_compare()` when we run without `OptimizePtrCompare`? > > On a separate note, can you also add a regression test? Maybe you can also just add a run with `-XX:-OptimizePtrCompare` - maybe together with `-XX:+VerifyReduceAllocationMerges` for more verification - to `compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java`. > > @JohnTortugo you might also want to have a look at this. hi @chhagedorn , I already update PR and add regression test. Please take another look when you have time . Thanks a lot. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2196583002 From shade at openjdk.org Thu Jul 10 06:11:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Jul 2025 06:11:41 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v8] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 20:13:57 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Apply 8360936-cosmetics-1.patch.txt from PR Looks good to me, thanks. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26072#pullrequestreview-3004074308 From mchevalier at openjdk.org Thu Jul 10 06:13:27 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 10 Jul 2025 06:13:27 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v6] In-Reply-To: References: Message-ID: <-rnlrm6PHRZeO1izbXh5nOrm368YKrsFft1u6SHXzWA=.9c8e6646-0b72-4705-895e-f795f74f3906@github.com> > When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 > > This is enforced by restoring the old state, like in > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 > > That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: > > ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) > > > Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. > > Another situation is somewhat worse, when happening during parsing. It can lead to such cases: > > ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) > > The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? > > This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 > > And here there is the challenge: > - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) > - we can't really change the pointer, just the content > -... Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: - Forgot to destruct_map_clone - +'_' and ctor init ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25936/files - new: https://git.openjdk.org/jdk/pull/25936/files/59133778..be5c0241 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25936&range=04-05 Stats: 6 lines in 2 files changed: 2 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/25936.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25936/head:pull/25936 PR: https://git.openjdk.org/jdk/pull/25936 From mchevalier at openjdk.org Thu Jul 10 06:13:27 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 10 Jul 2025 06:13:27 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v5] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 12:38:26 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: > > - Tentative addressing Vladimir's comments > - Re-insert Turns out in this `SavedState` tiny refactoring, I removed the underlying call to `destruct_map_clone`. It's probably benign up to memory consumption, and it made no test fail. Nevertheless, it's back. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25936#issuecomment-3055733721 From mchevalier at openjdk.org Thu Jul 10 06:13:28 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 10 Jul 2025 06:13:28 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v5] In-Reply-To: <06SLl1SEu1oWp3YWh9xR3LbQR5rDGs6thA5UAS_kMtk=.5fb6e090-7ac6-4e99-9700-b767f2a08348@github.com> References: <06SLl1SEu1oWp3YWh9xR3LbQR5rDGs6thA5UAS_kMtk=.5fb6e090-7ac6-4e99-9700-b767f2a08348@github.com> Message-ID: <4cs83TE2oMQrKZJBJy72dLKotjXgyh2WLheJyTPvvSM=.e22e5aea-befe-4857-86b3-a86ef6e9a57a@github.com> On Wed, 9 Jul 2025 16:34:15 GMT, Vladimir Kozlov wrote: >> Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: >> >> - Tentative addressing Vladimir's comments >> - Re-insert > > src/hotspot/share/opto/library_call.hpp line 147: > >> 145: SafePointNode* _map; >> 146: Unique_Node_List _ctrl_succ; >> 147: bool discarded = false; > > `discarded` is not static field. I suggest to initialize it in constructor. And use `_` prefix. > > Otherwise changes are good. Thank you for taking my suggestion. Done! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25936#discussion_r2196677989 From dfenacci at openjdk.org Thu Jul 10 06:38:39 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 10 Jul 2025 06:38:39 GMT Subject: RFR: 8358756: [s390x] Test StartupOutput.java crash due to CodeCache size [v2] In-Reply-To: References: <5EregFPpep4Y8cL1v_GnvR6vq415jVK-u_6MuCPNfm4=.8b2ac0ec-47fe-4f50-9815-a61ea400f58f@github.com> Message-ID: On Thu, 10 Jul 2025 03:37:37 GMT, Amit Kumar wrote: >>> Just for my understanding. Even if test passes we still want to see this warning: >>> >>> ``` >>> [warning][codecache] CodeCache is full. Compiler has been disabled. >>> ``` >> >> The test passes with and without that message. When the randomly chosen amount of code cache is not enough to start the compiler(s) it should print that message, when it is enough to start both compilers, you don't see that message. >> The important thing is that there is no crash when compilers are trying to reserve code cache (they should be just shut down). > > @dafedafe any further comments on this one :-) Sorry for the delay @offamitkumar, I left it a bit on the side... > What I wanted to verify with above expected crash is that current number are not enough for the compilers and we saw the output is containing that "Codecache is full" message. I think that if you want to check for the "Codecache is full..." message you should probably add a new test (as part of this regression test class) instead of changing this one as the purpose here is really to check for no crashes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25741#issuecomment-3055825472 From xgong at openjdk.org Thu Jul 10 07:10:23 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 10 Jul 2025 07:10:23 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation Message-ID: This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. ### Background Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. ### Implementation #### Challenges Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: - SPECIES_64: Single operation with mask (8 elements, 256-bit) - SPECIES_128: Single operation, full register (16 elements, 512-bit) - SPECIES_256: Two operations + merge (32 elements, 1024-bit) - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) Use `ByteVector.SPECIES_512` as an example: - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. - It requires 4 times of vector gather-loads to finish the whole operation. byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] int[] idx = [0, 1, 2, 3, ..., 63, ...] 4 gather-load: idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] #### Solution The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. Here is the main changes: - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. - Added `VectorSliceNode` for result merging. - Added `VectorMaskWidenNode` for mask spliting and type conversion for masked gather-load. - Implemented SVE match rules for subword gather operations. - Added comprehensive IR tests for verification. ### Testing: - Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests - No regressions found ### Performance: The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data: Benchmark SIZE Mode Cnt Unit Before After Gain GatherOperationsBenchmark.microByteGather128 64 thrpt 30 ops/ms 13500.891 46721.307 3.46 GatherOperationsBenchmark.microByteGather128 256 thrpt 30 ops/ms 3378.186 12321.847 3.64 GatherOperationsBenchmark.microByteGather128 1024 thrpt 30 ops/ms 844.871 3144.217 3.72 GatherOperationsBenchmark.microByteGather128 4096 thrpt 30 ops/ms 211.386 783.337 3.70 GatherOperationsBenchmark.microByteGather128_MASK 64 thrpt 30 ops/ms 10605.664 46124.957 4.34 GatherOperationsBenchmark.microByteGather128_MASK 256 thrpt 30 ops/ms 2668.531 12292.350 4.60 GatherOperationsBenchmark.microByteGather128_MASK 1024 thrpt 30 ops/ms 676.218 3074.224 4.54 GatherOperationsBenchmark.microByteGather128_MASK 4096 thrpt 30 ops/ms 169.402 817.227 4.82 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 10615.723 46122.380 4.34 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 2671.931 12222.473 4.57 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 678.437 3091.970 4.55 GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 170.310 813.967 4.77 GatherOperationsBenchmark.microByteGather128_NZ_OFF 64 thrpt 30 ops/ms 13524.671 47223.082 3.49 GatherOperationsBenchmark.microByteGather128_NZ_OFF 256 thrpt 30 ops/ms 3411.813 12343.308 3.61 GatherOperationsBenchmark.microByteGather128_NZ_OFF 1024 thrpt 30 ops/ms 847.919 3129.065 3.69 GatherOperationsBenchmark.microByteGather128_NZ_OFF 4096 thrpt 30 ops/ms 212.790 787.953 3.70 GatherOperationsBenchmark.microByteGather64 64 thrpt 30 ops/ms 8717.294 48176.937 5.52 GatherOperationsBenchmark.microByteGather64 256 thrpt 30 ops/ms 2184.345 12347.113 5.65 GatherOperationsBenchmark.microByteGather64 1024 thrpt 30 ops/ms 546.093 3070.851 5.62 GatherOperationsBenchmark.microByteGather64 4096 thrpt 30 ops/ms 136.724 767.656 5.61 GatherOperationsBenchmark.microByteGather64_MASK 64 thrpt 30 ops/ms 6576.504 48588.806 7.38 GatherOperationsBenchmark.microByteGather64_MASK 256 thrpt 30 ops/ms 1653.073 12341.291 7.46 GatherOperationsBenchmark.microByteGather64_MASK 1024 thrpt 30 ops/ms 416.590 3070.680 7.37 GatherOperationsBenchmark.microByteGather64_MASK 4096 thrpt 30 ops/ms 105.743 767.790 7.26 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 6628.974 48628.463 7.33 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 1676.767 12338.116 7.35 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 422.612 3070.987 7.26 GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 105.033 767.563 7.30 GatherOperationsBenchmark.microByteGather64_NZ_OFF 64 thrpt 30 ops/ms 8754.635 48525.395 5.54 GatherOperationsBenchmark.microByteGather64_NZ_OFF 256 thrpt 30 ops/ms 2182.044 12338.096 5.65 GatherOperationsBenchmark.microByteGather64_NZ_OFF 1024 thrpt 30 ops/ms 547.353 3071.666 5.61 GatherOperationsBenchmark.microByteGather64_NZ_OFF 4096 thrpt 30 ops/ms 137.853 767.745 5.56 GatherOperationsBenchmark.microShortGather128 64 thrpt 30 ops/ms 8713.480 37696.121 4.32 GatherOperationsBenchmark.microShortGather128 256 thrpt 30 ops/ms 2189.636 9479.710 4.32 GatherOperationsBenchmark.microShortGather128 1024 thrpt 30 ops/ms 545.435 2378.492 4.36 GatherOperationsBenchmark.microShortGather128 4096 thrpt 30 ops/ms 136.213 595.504 4.37 GatherOperationsBenchmark.microShortGather128_MASK 64 thrpt 30 ops/ms 6665.844 37765.315 5.66 GatherOperationsBenchmark.microShortGather128_MASK 256 thrpt 30 ops/ms 1673.950 9482.207 5.66 GatherOperationsBenchmark.microShortGather128_MASK 1024 thrpt 30 ops/ms 420.628 2378.813 5.65 GatherOperationsBenchmark.microShortGather128_MASK 4096 thrpt 30 ops/ms 105.128 595.412 5.66 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64 thrpt 30 ops/ms 6699.594 37698.398 5.62 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256 thrpt 30 ops/ms 1682.128 9480.355 5.63 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt 30 ops/ms 421.942 2380.449 5.64 GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt 30 ops/ms 106.587 595.560 5.58 GatherOperationsBenchmark.microShortGather128_NZ_OFF 64 thrpt 30 ops/ms 8788.830 37709.493 4.29 GatherOperationsBenchmark.microShortGather128_NZ_OFF 256 thrpt 30 ops/ms 2199.706 9485.769 4.31 GatherOperationsBenchmark.microShortGather128_NZ_OFF 1024 thrpt 30 ops/ms 548.309 2380.494 4.34 GatherOperationsBenchmark.microShortGather128_NZ_OFF 4096 thrpt 30 ops/ms 137.434 595.448 4.33 GatherOperationsBenchmark.microShortGather64 64 thrpt 30 ops/ms 5296.860 37797.813 7.13 GatherOperationsBenchmark.microShortGather64 256 thrpt 30 ops/ms 1321.738 9602.510 7.26 GatherOperationsBenchmark.microShortGather64 1024 thrpt 30 ops/ms 330.520 2404.013 7.27 GatherOperationsBenchmark.microShortGather64 4096 thrpt 30 ops/ms 82.149 602.956 7.33 GatherOperationsBenchmark.microShortGather64_MASK 64 thrpt 30 ops/ms 3458.968 37851.452 10.94 GatherOperationsBenchmark.microShortGather64_MASK 256 thrpt 30 ops/ms 879.143 9616.554 10.93 GatherOperationsBenchmark.microShortGather64_MASK 1024 thrpt 30 ops/ms 220.256 2408.851 10.93 GatherOperationsBenchmark.microShortGather64_MASK 4096 thrpt 30 ops/ms 54.947 603.251 10.97 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 64 thrpt 30 ops/ms 3521.856 37736.119 10.71 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 256 thrpt 30 ops/ms 881.456 9602.649 10.89 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 1024 thrpt 30 ops/ms 220.122 2409.030 10.94 GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF 4096 thrpt 30 ops/ms 55.845 603.126 10.79 GatherOperationsBenchmark.microShortGather64_NZ_OFF 64 thrpt 30 ops/ms 5279.815 37698.023 7.14 GatherOperationsBenchmark.microShortGather64_NZ_OFF 256 thrpt 30 ops/ms 1307.935 9601.551 7.34 GatherOperationsBenchmark.microShortGather64_NZ_OFF 1024 thrpt 30 ops/ms 329.707 2409.962 7.30 GatherOperationsBenchmark.microShortGather64_NZ_OFF 4096 thrpt 30 ops/ms 82.092 603.380 7.35 [1] https://bugs.openjdk.org/browse/JDK-8355563 [2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector-Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en [3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en ------------- Commit messages: - 8351623: VectorAPI: Add SVE implementation of subword gather load operation Changes: https://git.openjdk.org/jdk/pull/26236/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26236&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8351623 Stats: 972 lines in 22 files changed: 841 ins; 12 del; 119 mod Patch: https://git.openjdk.org/jdk/pull/26236.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236 PR: https://git.openjdk.org/jdk/pull/26236 From thartmann at openjdk.org Thu Jul 10 07:17:43 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 10 Jul 2025 07:17:43 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v6] In-Reply-To: <-rnlrm6PHRZeO1izbXh5nOrm368YKrsFft1u6SHXzWA=.9c8e6646-0b72-4705-895e-f795f74f3906@github.com> References: <-rnlrm6PHRZeO1izbXh5nOrm368YKrsFft1u6SHXzWA=.9c8e6646-0b72-4705-895e-f795f74f3906@github.com> Message-ID: On Thu, 10 Jul 2025 06:13:27 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: > > - Forgot to destruct_map_clone > - +'_' and ctor init Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25936#pullrequestreview-3004286462 From duke at openjdk.org Thu Jul 10 08:08:43 2025 From: duke at openjdk.org (erifan) Date: Thu, 10 Jul 2025 08:08:43 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> <6SXA9ZrXBDhZLyXP3lXbkpl4dl3iocvDpzPrUpIQOl8=.9b025be2-848b-4b78-a5e4-929cb7e9f798@github.com> Message-ID: <7QVWVj5vpSB42THa2rx-oxMqhH76qMZ5MBJjindRiLo=.b825076a-aa9c-4b86-94b6-0a593f2240ac@github.com> On Mon, 7 Jul 2025 09:08:37 GMT, Jatin Bhateja wrote: >>> What if during iterative GVN a constant -1 seeps through IR graph and gets connected to the input of VectorLongToMaskNode, you won't be able to create maskAll true in that case? >> >> Yes, this PR doesn't support this case. Maybe we should do this optimization in `ideal`. If `VectorLongToMask` is not supported, then try to convert it to `maskAll` or `Replicate` in intrinsic. >> >>> Do you see any advantage of doing this at intrinsic layer over entirely handling it in Java implimentation by simply modifying the opcode of fromBitsCoerced to MODE_BROADCAST from existing MODE_BITS_COERCED_LONG_TO_MASK for 0 or -1 input. >> >> I had tried this method and gave it up, because it has up to 34% performance regression for specific cases on x64. > >> > What if during iterative GVN a constant -1 seeps through IR graph and gets connected to the input of VectorLongToMaskNode, you won't be able to create maskAll true in that case? >> >> Yes, this PR doesn't support this case. Maybe we should do this optimization in `ideal`. If `VectorLongToMask` is not supported, then try to convert it to `maskAll` or `Replicate` in intrinsic. >> > > I would suggest extending VectorLongToMaskNode::Ideal for completeness of the solution. OK. But in order to cover various cases, the implementation may be a bit troublesome. The solution I thought of is to **check whether the architecture supports VectorLongToMask, MaskAll and Replicate in `LibraryCallKit::inline_vector_frombits_coerced`. If it does, generate VectorLongToMask, and then convert it to MaskAll or Replicate in IGVN**. This is similar to the current implementation of vector rotate. At the same time, this conversion may affect some other optimizations, such as `VectorMaskToLong(VectorLongToMask (x)) => x` and `VectorStoreMask(VectorLoadMask (x)) => x`. So we also need to fix these effects. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2196930141 From amitkumar at openjdk.org Thu Jul 10 08:29:39 2025 From: amitkumar at openjdk.org (Amit Kumar) Date: Thu, 10 Jul 2025 08:29:39 GMT Subject: RFR: 8358756: [s390x] Test StartupOutput.java crash due to CodeCache size [v2] In-Reply-To: References: <5EregFPpep4Y8cL1v_GnvR6vq415jVK-u_6MuCPNfm4=.8b2ac0ec-47fe-4f50-9815-a61ea400f58f@github.com> Message-ID: On Thu, 10 Jul 2025 03:37:37 GMT, Amit Kumar wrote: >>> Just for my understanding. Even if test passes we still want to see this warning: >>> >>> ``` >>> [warning][codecache] CodeCache is full. Compiler has been disabled. >>> ``` >> >> The test passes with and without that message. When the randomly chosen amount of code cache is not enough to start the compiler(s) it should print that message, when it is enough to start both compilers, you don't see that message. >> The important thing is that there is no crash when compilers are trying to reserve code cache (they should be just shut down). > > @dafedafe any further comments on this one :-) > Sorry for the delay @offamitkumar, I left it a bit on the side... > > > What I wanted to verify with above expected crash is that current number are not enough for the compilers and we saw the output is containing that "Codecache is full" message. > > I think that if you want to check for the "Codecache is full..." message you should probably add a new test (as part of this regression test class) instead of changing this one as the purpose here is really to check for no crashes. I think current modification is enough for us as well. Those message tweaks were just for my own curiosity and not required any further. Are you fine with the current changes ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25741#issuecomment-3056293756 From dzhang at openjdk.org Thu Jul 10 08:36:12 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 10 Jul 2025 08:36:12 GMT Subject: RFR: 8361829: [TESTBUG] RISC-V: compiler/vectorization/runner/BasicIntOpTest.java fails when using RVV without using zvbb Message-ID: Hi all, Please take a look and review this PR, thanks! After JDK-8355293 , compiler/vectorization/runner/BasicIntOpTest.java fails when using RVV without using zvbb. The reason for the error is that PopCountVI on RISC-V requires zvbb, not justrvv. ### Test - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on k1 - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on qemu-system (enable RVV) w/ and w/o zvbb ------------- Commit messages: - 8361829: [TESTBUG] RISC-V: compiler/vectorization/runner/BasicIntOpTest.java fails when using RVV without using zvbb Changes: https://git.openjdk.org/jdk/pull/26238/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26238&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361829 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26238.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26238/head:pull/26238 PR: https://git.openjdk.org/jdk/pull/26238 From shade at openjdk.org Thu Jul 10 08:38:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Jul 2025 08:38:39 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 02:37:05 GMT, Igor Veresov wrote: > Use OopStorage directly instead of JNI handles. Note that we never destroy TrainingData objects, so we don't need to concern ourselves with freeing the OopStorage entries. Also, keeping the klasses alive is only necessary during the training run. During the replay the klasses TD objects refer to are always alive. src/hotspot/share/oops/trainingData.cpp line 437: > 435: KlassTrainingData::KlassTrainingData(InstanceKlass* klass) : TrainingData(klass) { > 436: assert(klass != nullptr, ""); > 437: oop* handle = oop_storage()->allocate(); I don't think you are supposed to allocate from `OopStorage` directly, that's the job for various `Handle`-s. Also, capturing the `java_mirror` does not really block the unloading, see: // Loading the java_mirror does not keep its holder alive. See Klass::keep_alive(). inline oop Klass::java_mirror() const { return _java_mirror.resolve(); } So the idiomatic way would be: _holder_mirror = OopHandle(Universe::vm_global(), klass->klass_holder()); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26233#discussion_r2196997605 From shade at openjdk.org Thu Jul 10 08:44:45 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Jul 2025 08:44:45 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v8] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 20:13:57 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Apply 8360936-cosmetics-1.patch.txt from PR Marked as reviewed by shade (Reviewer). test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 151: > 149: while (iter.hasNext()) { > 150: String line = iter.next().trim(); > 151: if (line.startsWith(";;}")) { Oh, apologies, I made a little mistake here when playing around with trims. Should be: Suggestion: if (line.startsWith(";; }")) { The test probably does not fail because it meets no instructions beyond the spin_wait block. Cleaner to fix it anyway. ------------- PR Review: https://git.openjdk.org/jdk/pull/26072#pullrequestreview-3004580240 PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2197012939 From dzhang at openjdk.org Thu Jul 10 09:23:11 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 10 Jul 2025 09:23:11 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors Message-ID: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. ### Test qemu-system UseRVV: * [x] Run jdk_vector (fastdebug) * [x] Run compiler/vectorapi (fastdebug) ### Performance Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): Benchmark (SIZE) Mode Units Before After Gain VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). ------------- Commit messages: - 8361836: RISC-V: Relax min vector length to 32-bit for short vectors Changes: https://git.openjdk.org/jdk/pull/26239/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26239&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361836 Stats: 18 lines in 2 files changed: 18 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26239.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26239/head:pull/26239 PR: https://git.openjdk.org/jdk/pull/26239 From dzhang at openjdk.org Thu Jul 10 09:26:53 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 10 Jul 2025 09:26:53 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v2] In-Reply-To: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: <5PCmTwnensUBsUNqVlxUuK6L2nDHIOqek7KEH5r_h_M=.9a05eebc-f3ba-4b0e-b0e0-76e89661c89d@github.com> > Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. > So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. > > ### Test > qemu-system UseRVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) > > ### Performance > Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): > > > Benchmark (SIZE) Mode Units Before After Gain > VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 > VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 > VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 > VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 > > PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Adjust the position of comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26239/files - new: https://git.openjdk.org/jdk/pull/26239/files/06598543..0773a366 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26239&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26239&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26239.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26239/head:pull/26239 PR: https://git.openjdk.org/jdk/pull/26239 From bkilambi at openjdk.org Thu Jul 10 09:55:44 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 10 Jul 2025 09:55:44 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13] In-Reply-To: References: Message-ID: <75qy49Hm0CDAirFRCqKVrLS_QKt6J-p4c1vryUBHCE8=.b787f304-6444-4e16-acaa-049ea4be2670@github.com> On Thu, 10 Jul 2025 03:15:24 GMT, Xiaohong Gong wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2919: >> >>> 2917: ins(tmp, D, src2, 1, 0); >>> 2918: tbl(dst, size1, tmp, 1, dst); >>> 2919: } >> >> Is it better than we wrap this part as a help function, because the code is much the same with line2885-2898? > > These two functions can be refined more clearly. Following is my version: > > void C2_MacroAssembler::select_from_two_vectors_neon(FloatRegister dst, FloatRegister src1, > FloatRegister src2, FloatRegister index, > FloatRegister tmp, unsigned length_in_bytes) { > assert_different_registers(dst, src1, src2, tmp); > SIMD_Arrangement size = length_in_bytes == 16 ? T16B : T8B; > > if (length_in_bytes == 16) { > assert(UseSVE <= 1, "sve must be <= 1"); > // If the vector length is 16B, then use the Neon "tbl" instruction with two vector table > tbl(dst, size, src1, 2, index); > } else { // vector length == 8 > assert(UseSVE == 0, "must be Neon only"); > // We need to fit both the source vectors (src1, src2) in a 128-bit register because the > // Neon "tbl" instruction supports only looking up 16B vectors. We then use the Neon "tbl" > // instruction with one vector lookup > ins(tmp, D, src1, 0, 0); > ins(tmp, D, src2, 1, 0); > tbl(dst, size, tmp, 1, index); > } > } > > void C2_MacroAssembler::select_from_two_vectors_sve(FloatRegister dst, FloatRegister src1, > FloatRegister src2, FloatRegister index, > FloatRegister tmp, BasicType bt, > unsigned length_in_bytes) { > assert_different_registers(dst, src1, src2, index, tmp); > SIMD_RegVariant T = elemType_to_regVariant(bt); > if (length_in_bytes == 8) { > assert(UseSVE >= 1, "must be"); > ins(tmp, D, src1, 0, 0); > ins(tmp, D, src2, 1, 0); > sve_tbl(dst, T, tmp, index); > } else { > assert(UseSVE == 2 && length_in_bytes == MaxVectorSize, "must be"); > sve_tbl(dst, T, src1, src2, index); > } > } > > void C2_MacroAssembler::select_from_two_vectors(FloatRegister dst, FloatRegister src1, > FloatRegister src2, FloatRegister index, > FloatRegister tmp, BasicType bt, > unsigned length_in_bytes) { > > assert_different_registers(dst, src1, src2, index, tmp); > > if (UseSVE == 2 || (UseSVE == 1 && length_in_bytes == 8)) { > select_from_two_vectors_sve(dst, src1, src2, index, tmp, bt, length_in_bytes); > return; > } > > // The only BasicTypes that can reach here are T_SHORT, T_BYTE, T_INT and T_FLOAT > assert(bt != T_DOUBLE ... Thanks a lot for your suggestion @XiaohongGong . I will try this suggestion and see how it looks and get back. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2197182232 From mchevalier at openjdk.org Thu Jul 10 10:48:43 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 10 Jul 2025 10:48:43 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v5] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 12:36:31 GMT, Marc Chevalier wrote: >> A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. >> >> ## Pure Functions >> >> Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. >> >> ## Scope >> >> We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. >> >> ## Implementation Overview >> >> We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. >> >> This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Tentative to address Tobias' comments @iwanowww would you like to take a look at it, since you have quite some context already? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25760#issuecomment-3056888461 From eastigeevich at openjdk.org Thu Jul 10 11:21:25 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 10 Jul 2025 11:21:25 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v9] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Fix detection of block comment end; Use specilized lambda function to count instructions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/1b4d81be..92a20a20 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=07-08 Stats: 26 lines in 1 file changed: 18 ins; 4 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From eastigeevich at openjdk.org Thu Jul 10 11:24:42 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 10 Jul 2025 11:24:42 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v8] In-Reply-To: References: Message-ID: <-7KCidOhy3DWOczvBPEZ195-QLvz451RC7QmAdiGAlQ=.ad0c0584-4b0b-4506-8944-3a4d1113fbbd@github.com> On Thu, 10 Jul 2025 08:41:58 GMT, Aleksey Shipilev wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply 8360936-cosmetics-1.patch.txt from PR > > test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 151: > >> 149: while (iter.hasNext()) { >> 150: String line = iter.next().trim(); >> 151: if (line.startsWith(";;}")) { > > Oh, apologies, I made a little mistake here when playing around with trims. Should be: > Suggestion: > > if (line.startsWith(";; }")) { > > > The test probably does not fail because it meets no instructions beyond the spin_wait block. Cleaner to fix it anyway. I fixed this. I have also added a specialized lambda function to count expected instructions: - for disassembled code, just check a line. - for hex code, split and count. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2197435713 From fyang at openjdk.org Thu Jul 10 11:39:38 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 10 Jul 2025 11:39:38 GMT Subject: RFR: [TESTBUG] RISC-V: compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb In-Reply-To: References: Message-ID: <97uBEuOHsVxLaR0PVkzcvVvUFlhBpOcYzjC_e8xc77k=.47d7bcac-d228-409a-8b76-9c56b6e0d74c@github.com> On Thu, 10 Jul 2025 08:31:53 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After JDK-8355293 , compiler/vectorization/runner/BasicIntOpTest.java failswith RVV but not Zvbb. > The reason for the error is that PopCountVI on RISC-V requires zvbb, not just rvv. > > ### Test > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on k1 > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on qemu-system (enable RVV) w/ and w/o zvbb Thanks. Looks reasonable. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26238#pullrequestreview-3005287870 From coleenp at openjdk.org Thu Jul 10 11:54:39 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 10 Jul 2025 11:54:39 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 08:35:48 GMT, Aleksey Shipilev wrote: >> Use OopStorage directly instead of JNI handles. Note that we never destroy TrainingData objects, so we don't need to concern ourselves with freeing the OopStorage entries. Also, keeping the klasses alive is only necessary during the training run. During the replay the klasses TD objects refer to are always alive. > > src/hotspot/share/oops/trainingData.cpp line 437: > >> 435: KlassTrainingData::KlassTrainingData(InstanceKlass* klass) : TrainingData(klass) { >> 436: assert(klass != nullptr, ""); >> 437: oop* handle = oop_storage()->allocate(); > > I don't think you are supposed to allocate from `OopStorage` directly, that's the job for various `Handle`-s. Also, capturing the `java_mirror` does not really block the unloading, see: > > > // Loading the java_mirror does not keep its holder alive. See Klass::keep_alive(). > inline oop Klass::java_mirror() const { > return _java_mirror.resolve(); > } > > > So the idiomatic way would be: > > > _holder_mirror = OopHandle(Universe::vm_global(), klass->klass_holder()); What a confusing comment, but luckily it points to Klass::keep_alive() for context. Yes, please don't allocate an OopStorage handle directly. Then the OopHandle constructor will check for native oom. Otherwise this seems okay and better than using jni. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26233#discussion_r2197505170 From shade at openjdk.org Thu Jul 10 12:27:43 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Jul 2025 12:27:43 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v8] In-Reply-To: <-7KCidOhy3DWOczvBPEZ195-QLvz451RC7QmAdiGAlQ=.ad0c0584-4b0b-4506-8944-3a4d1113fbbd@github.com> References: <-7KCidOhy3DWOczvBPEZ195-QLvz451RC7QmAdiGAlQ=.ad0c0584-4b0b-4506-8944-3a4d1113fbbd@github.com> Message-ID: On Thu, 10 Jul 2025 11:21:55 GMT, Evgeny Astigeevich wrote: >> test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java line 151: >> >>> 149: while (iter.hasNext()) { >>> 150: String line = iter.next().trim(); >>> 151: if (line.startsWith(";;}")) { >> >> Oh, apologies, I made a little mistake here when playing around with trims. Should be: >> Suggestion: >> >> if (line.startsWith(";; }")) { >> >> >> The test probably does not fail because it meets no instructions beyond the spin_wait block. Cleaner to fix it anyway. > > I fixed this. > I have also added a specialized lambda function to count expected instructions: > - for disassembled code, just check a line. > - for hex code, split and count. Not sure lambdas make this cleaner, TBH. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2197580478 From chagedorn at openjdk.org Thu Jul 10 12:53:45 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 10 Jul 2025 12:53:45 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v2] In-Reply-To: References: Message-ID: <12wwp9Vw7IZOUSXfONjmvyj3cr1YaX85XdJZvGboUUs=.0c17c79f-74c3-4e7d-98a4-a8f68bb37b8f@github.com> On Thu, 10 Jul 2025 02:11:21 GMT, Guanqiang Han wrote: >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. > > Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - update modification and add regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp > > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. Thanks for the update! I have some follow-up comments. src/hotspot/share/opto/escape.cpp line 3280: > 3278: const TypeInt* EQ = TypeInt::CC_EQ; // [0] == ZERO > 3279: const TypeInt* NE = TypeInt::CC_GT; // [1] == ONE > 3280: const TypeInt* UNKNOWN = TypeInt::CC; // [-1, 0,1] I suggest to move the `UNKNOWN` definition up and then use `UNKNOWN` as return value which also serves as documentation. test/hotspot/jtreg/compiler/c2/TestReducePhiOnCmpWithNoOptPtrCompare.java line 29: > 27: * @summary Test ConnectionGraph::reduce_phi_on_cmp when OptimizePtrCompare is disabled > 28: * @library /test/lib / > 29: * @requires vm.debug == true `OptimizePtrCompare` is a product flag. Thus, you do not need this `requires`. Suggestion: test/hotspot/jtreg/compiler/c2/TestReducePhiOnCmpWithNoOptPtrCompare.java line 30: > 28: * @library /test/lib / > 29: * @requires vm.debug == true > 30: * @requires vm.compiler2.enabled I suggest to also remove this line and additionally pass `-XX:+IgnoreUnrecognizedVMOptions` as flag to the IR framework. Suggestion: test/hotspot/jtreg/compiler/c2/TestReducePhiOnCmpWithNoOptPtrCompare.java line 46: > 44: TestFramework framework = new TestFramework(); > 45: Scenario scenario0 = new Scenario(0, "-XX:-OptimizePtrCompare"); > 46: framework.addScenarios(scenario0).start(); Since you only use one setting, you can directly use `TestFramework.runWithFlags()`. Can you also add `-XX:+VerifyReduceAllocationMerges` for additional verification? I also suggest to add a copy of scenario 0 at `AllocationMergesTests` and add `-XX:-OptimizePtrCompare` and `-XX:+VerifyReduceAllocationMerges` to the scenario. test/hotspot/jtreg/compiler/c2/TestReducePhiOnCmpWithNoOptPtrCompare.java line 53: > 51: invocations++; > 52: Random random = info.getRandom(); > 53: boolean cond = invocations % 2 == 0; Why don't you just use `random.nextBoolean()`? ------------- PR Review: https://git.openjdk.org/jdk/pull/26125#pullrequestreview-3004309866 PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2197647931 PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2196840228 PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2196840831 PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2197643267 PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2197645377 From eastigeevich at openjdk.org Thu Jul 10 12:58:23 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 10 Jul 2025 12:58:23 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v10] In-Reply-To: References: Message-ID: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Remove lambda ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26072/files - new: https://git.openjdk.org/jdk/pull/26072/files/92a20a20..22644d5f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26072&range=08-09 Stats: 23 lines in 1 file changed: 4 ins; 17 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26072.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26072/head:pull/26072 PR: https://git.openjdk.org/jdk/pull/26072 From eastigeevich at openjdk.org Thu Jul 10 12:58:23 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 10 Jul 2025 12:58:23 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v8] In-Reply-To: References: <-7KCidOhy3DWOczvBPEZ195-QLvz451RC7QmAdiGAlQ=.ad0c0584-4b0b-4506-8944-3a4d1113fbbd@github.com> Message-ID: On Thu, 10 Jul 2025 12:24:51 GMT, Aleksey Shipilev wrote: >> I fixed this. >> I have also added a specialized lambda function to count expected instructions: >> - for disassembled code, just check a line. >> - for hex code, split and count. > > Not sure lambdas make this cleaner, TBH. Ok, the lambda is removed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26072#discussion_r2197659251 From shade at openjdk.org Thu Jul 10 13:02:07 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Jul 2025 13:02:07 GMT Subject: RFR: 8231269: CompileTask::is_unloaded is slow due to JNIHandles type checks [v23] In-Reply-To: References: Message-ID: > [JDK-8163511](https://bugs.openjdk.org/browse/JDK-8163511) made the `CompileTask` improvement to avoid blocking class unloading if a relevant compile task is in queue. Current code does a sleight-of-hand to make sure the the `method*` in `CompileTask` are still valid before using them. Still a noble goal, so we keep trying to do this. > > The code tries to switch weak JNI handle with a strong one when it wants to capture the holder to block unloading. Since we are reusing the same field, we have to do type checks like `JNIHandles::is_weak_global_handle(_method_holder)`. Unfortunately, that type-check goes all the way to `OopStorage` allocation code to verify the handle is really allocated in the relevant `OopStorage`. This takes internal `OopStorage` locks, and thus is slow. > > This issue is clearly visible in Leyden, when there are lots of `CompileTask`-s in the queue, dumped by AOT code loader. It also does not help that `CompileTask::select_task` is effectively quadratic in number of methods in queue, so we end up calling `CompileTask::is_unloaded` very often. > > It is possible to mitigate this issue by splitting the related fields into weak and strong ones. But as Kim mentions in the bug, we should not be using JNI handles here at all, and instead go directly for relevant `OopStorage`-s. This is what this PR does, among other things that should hopefully make the whole mechanics clearer. > > Additional testing: > - [x] Linux x86_64 server fastdebug, `compiler/classUnloading`, 100x still passes; these tests are sensitive to bugs in this code > - [x] Linux x86_64 server fastdebug, `all` > - [x] Linux AArch64 server fastdebug, `all` Aleksey Shipilev has updated the pull request incrementally with six additional commits since the last revision: - Docs touchup - Use enum class - Further simplify the API - Tune up for release builds - Move release() to destructor - Deal with things without spinlocks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24018/files - new: https://git.openjdk.org/jdk/pull/24018/files/d5a8a27d..b27c0633 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24018&range=22 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24018&range=21-22 Stats: 162 lines in 3 files changed: 35 ins; 94 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/24018.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24018/head:pull/24018 PR: https://git.openjdk.org/jdk/pull/24018 From shade at openjdk.org Thu Jul 10 13:02:09 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Jul 2025 13:02:09 GMT Subject: RFR: 8231269: CompileTask::is_unloaded is slow due to JNIHandles type checks [v22] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 19:26:41 GMT, Kim Barrett wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: >> >> - Merge branch 'master' into JDK-8231269-compile-task-weaks >> - Merge branch 'master' into JDK-8231269-compile-task-weaks >> - Switch to mutable >> - Merge branch 'master' into JDK-8231269-compile-task-weaks >> - More touchups >> - Spin lock induces false sharing >> - Merge branch 'master' into JDK-8231269-compile-task-weaks >> - Merge branch 'master' into JDK-8231269-compile-task-weaks >> - Rename CompilerTask::is_unloaded back to avoid losing comment context >> - Simplify select_for_compilation >> - ... and 27 more: https://git.openjdk.org/jdk/compare/a41d3507...d5a8a27d > > src/hotspot/share/oops/unloadableMethodHandle.hpp line 81: > >> 79: friend class VMStructs; >> 80: private: >> 81: enum State { > > Not really a review, just a drive-by comment. > I think the only argument against using an enum class here is the lack of C++20's "using enums" > feature: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1099r5.html > Personally I'd prefer to just make it an enum class and scope the references. YMMV. > > Also, someday we should try to come to some consensus about the naming of constants. I don't mind converting this to enum class, done in new commit. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24018#discussion_r2197669086 From shade at openjdk.org Thu Jul 10 13:08:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 10 Jul 2025 13:08:39 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v10] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 12:58:23 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Remove lambda Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26072#pullrequestreview-3005616214 From mablakatov at openjdk.org Thu Jul 10 13:53:25 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Thu, 10 Jul 2025 13:53:25 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: Message-ID: > Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. > > Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. > > The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. > > Benchmarks results: > > Neoverse-V1 (SVE 256-bit) > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms > IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms > LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms > > > Fujitsu A64FX (SVE 512-bit): > > Benchmark (size) Mode master PR Units > ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms > ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms > IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms > LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms > FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms > DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: remove the strictly-ordered FP implementation as unused ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23181/files - new: https://git.openjdk.org/jdk/pull/23181/files/d35f1089..4593a5d7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=06-07 Stats: 119 lines in 4 files changed: 8 ins; 105 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/23181.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181 PR: https://git.openjdk.org/jdk/pull/23181 From mablakatov at openjdk.org Thu Jul 10 14:09:42 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Thu, 10 Jul 2025 14:09:42 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 05:53:44 GMT, Xiaohong Gong wrote: >> @mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply. >> >> However, I do plan to remove the auto-vectorization restrictions for simple reductions. >> https://bugs.openjdk.org/browse/JDK-8307516 >> >> You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`. >> https://bugs.openjdk.org/browse/JDK-8357530 >> I published benchmark results there: >> https://github.com/openjdk/jdk/pull/25387 >> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable. >> >> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;) >> >> I don't have access to any SVE machines, so I cannot help you there, unfortunately. >> >> Is this helpful to you? > >> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable. >> >> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;) >> >> I don't have access to any SVE machines, so I cannot help you there, unfortunately. >> >>Is this helpful to you? > > Thanks for your input @eme64 ! It's really helpful to me. And it would be the right direction that using the cost model to guide whether vectorizing FP mul reduction is profitable or not. With this, I think the backend check of auto-vectorization for such operations can be removed safely. We can relay on the SLP's analysis. > > BTW, the current profitability heuristics can provide help on disabling auto-vectorization for the simple cases while enabling the complex ones. This is also helpful to us. > > I tested the performance of `VectorReduction2` with/without auto-vectorization for FP mul reductions on my SVE 128-bit machine. The performance difference is not very significant for both `floatMulSimple` and `floatMulBig`. But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! If the performance of `floatMulBig` is improved with auto-vectorization, I think we can remove the limitation of such reductions for auto-vectorization on AArch64. @XiaohongGong , @shqking , @eme64 , Thank you all for the insightful and detailed comments! I really appreciate the effort to explore the performance implications of auto-vectorization cases. I agree it would be helpful if @fg1417 could join this discussion. However, before diving deeper, I?d like to clarify the problem statement as we see it. I've also updated the JBS ticket accordingly, and I?m citing the key part here for visibility: > To clarify, the goal of this ticket is to improve the performance of mul reduction VectorAPI operations on SVE-capable platforms with vector lengths greater than 128 bits (e.g., Neoverse V1). The core issue is that these APIs are not being lowered to any AArch64 implementation at all on such platforms. Instead, the fallback Java implementation is used. This PR does **not** target improvements in auto-vectorization. In the context of auto-vectorization, the scope of this PR is limited to maintaining correctness and avoiding regressions. @shqking , regarding the case-2 that you highlighted - I believe this change is incidental. Prior to the patch, `Matcher::match_rule_supported_auto_vectorization()` returned false for NEON platforms (as expected) and true for 128-bit SVE. This behavior is misleading because HotSpot currently uses the **same scalar mul reduction implementation** for both NEON and SVE platforms. Since this implementation is unprofitable on both, it should have been disabled across the board. @fg1417, please correct me if I?m mistaken. This PR cannot leave `Matcher::match_rule_supported_auto_vectorization()` unchanged. If we do, HotSpot will select the strictly-ordered FP vector reduction implementation, which is not performant. A more efficient SVE-based implementation can't be used due to the strict ordering requirement. @XiaohongGong , > But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! Here are performance numbers for Neoverse V1 with the auto-vectorization restriction in `Matcher::match_rule_supported_auto_vectorization()` lifted (`After`). The linear strictly-ordered SVE implementation matched this way was later removed by https://github.com/openjdk/jdk/pull/23181/commits/4593a5d717024df01769625993c2b769d8dde311. | Benchmark | Before (ns/op) | After (ns/op) | Diff (%) | |:-----------------------------------------------|-----------------:|----------------:|:-----------| | VectorReduction.WithSuperword.mulRedD | 401.679 | 401.704 | ~ | | VectorReduction2.WithSuperword.doubleMulBig | 2365.554 | 7294.706 | +208.37% | | VectorReduction2.WithSuperword.doubleMulSimple | 2321.154 | 2321.207 | ~ | | VectorReduction2.WithSuperword.floatMulBig | 2356.006 | 2648.334 | +12.41% | | VectorReduction2.WithSuperword.floatMulSimple | 2321.018 | 2321.135 | ~ | Given that: - this PR focuses on VectorAPI and **not** on auto-vectorization, - and it does **not** introduce regressions in auto-vectorization performance, I suggest: - continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop; - moving forward with resolving the remaining VectorAPI issues and merging this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3057612901 From mchevalier at openjdk.org Thu Jul 10 14:24:21 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 10 Jul 2025 14:24:21 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder Message-ID: In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. Meaning that in the IR rule @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) The interpreted `\w` is interpreted as a group reference, and we get java.lang.IllegalArgumentException: Illegal group reference so we should write instead @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. Thanks, Marc ------------- Commit messages: - Fix test/hotspot/jtreg/compiler/c2/TestMergeStores.java - quoteReplacement Changes: https://git.openjdk.org/jdk/pull/26243/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26243&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361494 Stats: 238 lines in 3 files changed: 13 ins; 0 del; 225 mod Patch: https://git.openjdk.org/jdk/pull/26243.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26243/head:pull/26243 PR: https://git.openjdk.org/jdk/pull/26243 From fjiang at openjdk.org Thu Jul 10 14:26:40 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 10 Jul 2025 14:26:40 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v2] In-Reply-To: References: Message-ID: On Fri, 27 Jun 2025 12:43:30 GMT, Galder Zamarre?o wrote: > I can't really review it since I'm not familiar with neither riscv, nor the flag nor the COH logic. Thank you! Hi @dean-long, @rwestrel, could you help to take a look at this C1 related change? TIA. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25976#issuecomment-3057674572 From duke at openjdk.org Thu Jul 10 14:32:57 2025 From: duke at openjdk.org (Guanqiang Han) Date: Thu, 10 Jul 2025 14:32:57 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v3] In-Reply-To: References: Message-ID: > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - update regression test - Merge remote-tracking branch 'upstream/master' into 8361140 - update modification and add regression test - Merge remote-tracking branch 'upstream/master' into 8361140 - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26125/files - new: https://git.openjdk.org/jdk/pull/26125/files/2feca6a8..fd6f90f5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26125&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26125&range=01-02 Stats: 1158 lines in 25 files changed: 452 ins; 663 del; 43 mod Patch: https://git.openjdk.org/jdk/pull/26125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26125/head:pull/26125 PR: https://git.openjdk.org/jdk/pull/26125 From duke at openjdk.org Thu Jul 10 14:40:40 2025 From: duke at openjdk.org (Guanqiang Han) Date: Thu, 10 Jul 2025 14:40:40 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v2] In-Reply-To: <12wwp9Vw7IZOUSXfONjmvyj3cr1YaX85XdJZvGboUUs=.0c17c79f-74c3-4e7d-98a4-a8f68bb37b8f@github.com> References: <12wwp9Vw7IZOUSXfONjmvyj3cr1YaX85XdJZvGboUUs=.0c17c79f-74c3-4e7d-98a4-a8f68bb37b8f@github.com> Message-ID: On Thu, 10 Jul 2025 12:50:03 GMT, Christian Hagedorn wrote: >> Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - update modification and add regression test >> - Merge remote-tracking branch 'upstream/master' into 8361140 >> - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp >> >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. > > Thanks for the update! I have some follow-up comments. hi @chhagedorn ?thanks a lot for your detailed and thoughtful review. I'm still learning more about this area, and your feedback has been a great help. I've updated the PR based on your suggestions . Please feel free to let me know if anything else needs improvement. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26125#issuecomment-3057725959 From iveresov at openjdk.org Thu Jul 10 14:45:38 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Thu, 10 Jul 2025 14:45:38 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data In-Reply-To: References: Message-ID: <5j7Yko4SxyZxJkDlO9itXMmbLK9W7Mz4b0IYQMplNKA=.3aace3f2-7233-4c5f-8b85-2f4d3b9458fa@github.com> On Thu, 10 Jul 2025 11:51:34 GMT, Coleen Phillimore wrote: >> src/hotspot/share/oops/trainingData.cpp line 437: >> >>> 435: KlassTrainingData::KlassTrainingData(InstanceKlass* klass) : TrainingData(klass) { >>> 436: assert(klass != nullptr, ""); >>> 437: oop* handle = oop_storage()->allocate(); >> >> I don't think you are supposed to allocate from `OopStorage` directly, that's the job for various `Handle`-s. Also, capturing the `java_mirror` does not really block the unloading, see: >> >> >> // Loading the java_mirror does not keep its holder alive. See Klass::keep_alive(). >> inline oop Klass::java_mirror() const { >> return _java_mirror.resolve(); >> } >> >> >> So the idiomatic way would be: >> >> >> _holder_mirror = OopHandle(Universe::vm_global(), klass->klass_holder()); > > What a confusing comment, but luckily it points to Klass::keep_alive() for context. Yes, please don't allocate an OopStorage handle directly. Then the OopHandle constructor will check for native oom. > > Otherwise this seems okay and better than using jni. @coleenp I kind of like the fact that I can get rid of a field in KTD that previously stored a handle. Perhaps that benefit justifies a direct allocation? @shipilev I don't understand your comment about loading the mirror. I'm not merely loading it, I'm registering it as a root. It is also all happening in a safepoint-free context. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26233#discussion_r2197939820 From mhaessig at openjdk.org Thu Jul 10 14:54:39 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 10 Jul 2025 14:54:39 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder In-Reply-To: References: Message-ID: <3BYjYX_c2LSt2wepKGpXhRC85bJfz2wAmJhCMY2CMb0=.ed132047-cf0a-40a6-8a3e-69973d0bbaf1@github.com> On Thu, 10 Jul 2025 12:54:55 GMT, Marc Chevalier wrote: > In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. > > Meaning that in the IR rule > > @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) > > The interpreted `\w` is interpreted as a group reference, and we get > > java.lang.IllegalArgumentException: Illegal group reference > > so we should write instead > > @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) > > To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). > > Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. > > Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! > > In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. > > Thanks, > Marc Thank you for implementing this improvement @marc-chevalier! Nice to see this paper cut getting eliminated. I have a small suggestion, but this looks good to me regardless. test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/checkattribute/parsing/RawIRNode.java line 65: > 63: nodeRegex = regexForVectorIRNode(nodeRegex, vmInfo, bound); > 64: } else if (userPostfix.isValid()) { > 65: nodeRegex = nodeRegex.replaceAll(IRNode.IS_REPLACED, java.util.regex.Matcher.quoteReplacement(userPostfix.value())); Perhaps you might want to `import java.util.regex.Matcher` to make it a bit more concise? ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26243#pullrequestreview-3006077669 PR Review Comment: https://git.openjdk.org/jdk/pull/26243#discussion_r2197957431 From thartmann at openjdk.org Thu Jul 10 14:57:41 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 10 Jul 2025 14:57:41 GMT Subject: [jdk25] RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 17:55:31 GMT, Vladimir Kozlov wrote: > Hi all, > > This pull request contains a backport of commit [dedcce04](https://github.com/openjdk/jdk/commit/dedcce045013b3ff84f5ef8857e1a83f0c09f9ad) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Vladimir Kozlov on 8 Jul 2025 and was reviewed by Andrew Dinn and Matthias Baesken. > > Thanks! Marked as reviewed by thartmann (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26223#pullrequestreview-3006114062 From bkilambi at openjdk.org Thu Jul 10 15:22:44 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 10 Jul 2025 15:22:44 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v10] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 10:04:26 GMT, Jatin Bhateja wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments > > test/hotspot/jtreg/compiler/vectorapi/TestSelectFromTwoVectorOp.java line 234: > >> 232: >> 233: @Test >> 234: @IR(counts = {IRNode.SELECT_FROM_TWO_VECTOR_VS, IRNode.VECTOR_SIZE_8, ">0"}, > > Hi @Bhavana-Kilambi , > Kindly also include x86-specific feature checks in IR rules for this test. > > You can directly integrate attached patch. > > [select_from_ir_feature.txt](https://github.com/user-attachments/files/21034639/select_from_ir_feature.txt) Hi @jatin-bhateja , have you tested `jdk/incubator/vector` tests with your patch on x86? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2198044066 From fgao at openjdk.org Thu Jul 10 15:52:45 2025 From: fgao at openjdk.org (Fei Gao) Date: Thu, 10 Jul 2025 15:52:45 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v6] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 04:44:35 GMT, Hao Sun wrote: > Background: case-1 was set off after @fg1417 's patch [8275275: AArch64: Fix performance regression after auto-vectorization on NEON](https://github.com/openjdk/jdk/pull/10175). But case-2 was not touched. We are not sure about the reason. There was no 128b SVE machine then? Or there was some limitation of SLP on **reduction**? > > **Limitation** of SLP as mentioned in @fg1417 's patch > > > Because superword doesn't vectorize reductions unconnected with other vector packs, > > Performance data in this PR on case-2: From your provided [test data](https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067) on `Neoverse V2 (SVE 128-bit). Auto-vectorization section`, there is no obvious performance change on FP Mul Reduction benchmarks `(float|double)Mul(Big|Simple)`. As we checked the generated code of `floatMul(Big|Simple)` on Nvidia Grace machine(128b SVE2), we found that before this PR: > > * `floatMulBig` is vectorized. > * `floatMulSimple` is not vectorized because SLP determines that there is no profit. > > Discussion: should we enable case-1 and case-2? > > * if the SLP limitation on reductions is fixed? > * If there is no such limitation, we may consider enable case-1 and case-2 because a) there is perf regression at least based on current performance results and b) it may provide more auto-vectorization opportunities for other packs inside the loop. > > It would be appreciated if @eme64 or @fg1417 could provide more inputs. > @shqking Sorry for joining the discussion a bit late. The patch [8275275: AArch64: Fix performance regression after auto-vectorization on NEON](https://github.com/openjdk/jdk/pull/10175) was intended to fix a regression on `NEON` machine, while keeping the behaviour unchanged on `sve` machine ? which may be a source of confusion now. The reason I mentioned this SLP limitation in my previous patch was to clarify why the benchmark cases were written the way they were, and why I chose more complex cases instead of simpler reductions like `floatMulSimple`. The rationale was that if a case like `floatMulBig` doesn?t show any performance gain, then a simpler case like `floatMulSimple` is even less likely to benefit. In general, more complex reduction cases are more likely to benefit from auto-vectorization. @XiaohongGong thanks for testing on `128-bit sve` machine. Since the performance difference is not significant for both `floatMulSimple` and `floatMulBig` `with/without` auto-vectorization and there is a performance drop `with` auto-vectorization on `256-bit sve` machine reported by @mikabl-arm, it seems reasonable that it should also be disabled on SVE. I'm looking forward to having a cost model in place, so we can safely remove these restrictions and enable SLP to handle these scenarios more flexibly. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3058018414 From mablakatov at openjdk.org Thu Jul 10 15:55:51 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Thu, 10 Jul 2025 15:55:51 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v4] In-Reply-To: References: <3sWLk_sAMLtcvRUjXk9hYe-K2MBQl9fH2Qg0MF7lwDk=.b8867d51-e822-43c0-93ab-58228c6eb1d5@github.com> Message-ID: On Wed, 2 Jul 2025 01:46:18 GMT, Xiaohong Gong wrote: >> Well, we don't match it right now for auto-vectorization as it doesn't worth it performance-wise. This might change for future implementations of SVE(2). I'd still prefer to keep it so the set of instructions is complete. > > Removing is fine to me, as actually we do not have the case to test the correctness. Or maybe you could just do some changes locally (e.g. removing the `requires_strict_order` predication and the un-strict-order rule), and test it with VectorAPI cases? Done: https://github.com/openjdk/jdk/commit/4593a5d717024df01769625993c2b769d8dde311 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2198125027 From duke at openjdk.org Thu Jul 10 15:59:51 2025 From: duke at openjdk.org (Samuel Chee) Date: Thu, 10 Jul 2025 15:59:51 GMT Subject: RFR: 8361890: Aarch64: Removal of redundant dmb from C1 AtomicLong methods Message-ID: <60YMRP6cNslwEeVX2TWmnMYdO872xGaeShKMEj0dWGY=.2f4f504f-93d1-4bab-b721-e5c964f4c465@github.com> The current C1 implementation of AtomicLong methods which either adds or exchanges (such as getAndAdd) emit one of a ldaddal and swpal respectively when using LSE as well as an immediately proceeding dmb. Since ldaddal/swpal have both acquire and release semantics, this provides similar ordering guarantees to a dmb.full so the dmb here is redundant and can be removed. This is due to both clause 7 and clause 11 of the definition of Barrier-ordered-before in B2.3.7 of the DDI0487 L.a Arm Architecture Reference Manual for A-profile architecture being satisfied by the existence of a ldaddal/swpal which ensures such memory ordering guarantees. ------------- Commit messages: - 8361890: AArch64: Removal of redundant dmb from C1 AtomicLong methods Changes: https://git.openjdk.org/jdk/pull/26245/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26245&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361890 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26245.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26245/head:pull/26245 PR: https://git.openjdk.org/jdk/pull/26245 From coleenp at openjdk.org Thu Jul 10 16:50:45 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 10 Jul 2025 16:50:45 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data In-Reply-To: <5j7Yko4SxyZxJkDlO9itXMmbLK9W7Mz4b0IYQMplNKA=.3aace3f2-7233-4c5f-8b85-2f4d3b9458fa@github.com> References: <5j7Yko4SxyZxJkDlO9itXMmbLK9W7Mz4b0IYQMplNKA=.3aace3f2-7233-4c5f-8b85-2f4d3b9458fa@github.com> Message-ID: On Thu, 10 Jul 2025 14:42:47 GMT, Igor Veresov wrote: >> What a confusing comment, but luckily it points to Klass::keep_alive() for context. Yes, please don't allocate an OopStorage handle directly. Then the OopHandle constructor will check for native oom. >> >> Otherwise this seems okay and better than using jni. > > @coleenp I kind of like the fact that I can get rid of a field in KTD that previously stored a handle. Perhaps that benefit justifies a direct allocation? > > @shipilev I don't understand your comment about loading the mirror. I'm not merely loading it, I'm registering it as a root. It is also all happening in a safepoint-free context. Then just have OopHandle handle = OopHandle(Universe::vm_global(), klass->klass_holder()); or similar. You don't have to store it unless you're going to release it. But footprint seems unimportant in this case. Also, maybe OopStorage::allocate() should just be friends with OopHandle. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26233#discussion_r2198233387 From kvn at openjdk.org Thu Jul 10 17:04:40 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Jul 2025 17:04:40 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v6] In-Reply-To: <-rnlrm6PHRZeO1izbXh5nOrm368YKrsFft1u6SHXzWA=.9c8e6646-0b72-4705-895e-f795f74f3906@github.com> References: <-rnlrm6PHRZeO1izbXh5nOrm368YKrsFft1u6SHXzWA=.9c8e6646-0b72-4705-895e-f795f74f3906@github.com> Message-ID: On Thu, 10 Jul 2025 06:13:27 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: > > - Forgot to destruct_map_clone > - +'_' and ctor init Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25936#pullrequestreview-3006571921 From iveresov at openjdk.org Thu Jul 10 17:05:40 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Thu, 10 Jul 2025 17:05:40 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data In-Reply-To: References: <5j7Yko4SxyZxJkDlO9itXMmbLK9W7Mz4b0IYQMplNKA=.3aace3f2-7233-4c5f-8b85-2f4d3b9458fa@github.com> Message-ID: On Thu, 10 Jul 2025 16:48:17 GMT, Coleen Phillimore wrote: >> @coleenp I kind of like the fact that I can get rid of a field in KTD that previously stored a handle. Perhaps that benefit justifies a direct allocation? >> >> @shipilev I don't understand your comment about loading the mirror. I'm not merely loading it, I'm registering it as a root. It is also all happening in a safepoint-free context. > > Then just have OopHandle handle = OopHandle(Universe::vm_global(), klass->klass_holder()); > > or similar. You don't have to store it unless you're going to release it. But footprint seems unimportant in this case. Also, maybe OopStorage::allocate() should just be friends with OopHandle. > > Edit: looks like it's also used by string deduplication but this shouldn't call OopStorage::allocate directly. Ah, sorry, I was somehow under the impression that OopHandle would have a destructor that release the handle. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26233#discussion_r2198257057 From kvn at openjdk.org Thu Jul 10 17:07:46 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Jul 2025 17:07:46 GMT Subject: [jdk25] RFR: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 17:55:31 GMT, Vladimir Kozlov wrote: > Hi all, > > This pull request contains a backport of commit [dedcce04](https://github.com/openjdk/jdk/commit/dedcce045013b3ff84f5ef8857e1a83f0c09f9ad) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Vladimir Kozlov on 8 Jul 2025 and was reviewed by Andrew Dinn and Matthias Baesken. > > Thanks! Thank you, Aleksey and Tobias, for reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26223#issuecomment-3058254842 From kvn at openjdk.org Thu Jul 10 17:07:47 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 10 Jul 2025 17:07:47 GMT Subject: [jdk25] Integrated: 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 17:55:31 GMT, Vladimir Kozlov wrote: > Hi all, > > This pull request contains a backport of commit [dedcce04](https://github.com/openjdk/jdk/commit/dedcce045013b3ff84f5ef8857e1a83f0c09f9ad) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Vladimir Kozlov on 8 Jul 2025 and was reviewed by Andrew Dinn and Matthias Baesken. > > Thanks! This pull request has now been integrated. Changeset: e92f387a Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/e92f387ab5db8245778c19a35f08079dfa46453c Stats: 7 lines in 2 files changed: 4 ins; 0 del; 3 mod 8360942: [ubsan] aotCache tests trigger runtime error: applying non-zero offset 16 to null pointer in CodeBlob::relocation_end() Reviewed-by: shade, thartmann Backport-of: dedcce045013b3ff84f5ef8857e1a83f0c09f9ad ------------- PR: https://git.openjdk.org/jdk/pull/26223 From iveresov at openjdk.org Thu Jul 10 17:10:22 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Thu, 10 Jul 2025 17:10:22 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data [v2] In-Reply-To: References: Message-ID: > Use OopStorage directly instead of JNI handles. Note that we never destroy TrainingData objects, so we don't need to concern ourselves with freeing the OopStorage entries. Also, keeping the klasses alive is only necessary during the training run. During the replay the klasses TD objects refer to are always alive. Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26233/files - new: https://git.openjdk.org/jdk/pull/26233/files/e262230b..80d33dab Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26233&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26233&range=00-01 Stats: 10 lines in 2 files changed: 0 ins; 6 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26233.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26233/head:pull/26233 PR: https://git.openjdk.org/jdk/pull/26233 From coleenp at openjdk.org Thu Jul 10 17:14:38 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 10 Jul 2025 17:14:38 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data [v2] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 17:10:22 GMT, Igor Veresov wrote: >> Use OopStorage directly instead of JNI handles. Note that we never destroy TrainingData objects, so we don't need to concern ourselves with freeing the OopStorage entries. Also, keeping the klasses alive is only necessary during the training run. During the replay the klasses TD objects refer to are always alive. > > Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Looks good. thanks for the comment. ------------- Marked as reviewed by coleenp (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26233#pullrequestreview-3006599873 From cslucas at openjdk.org Thu Jul 10 17:21:40 2025 From: cslucas at openjdk.org (Cesar Soares Lucas) Date: Thu, 10 Jul 2025 17:21:40 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v3] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 14:32:57 GMT, Guanqiang Han wrote: >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. > > Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - update regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - update modification and add regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp > > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. Marked as reviewed by cslucas (Committer). test/hotspot/jtreg/compiler/c2/TestReducePhiOnCmpWithNoOptPtrCompare.java line 47: > 45: @Run(test = {"testReducePhiOnCmp_C2"}) > 46: public void runner(RunInfo info) { > 47: invocations++; I don't think you need this variable anymroe. ------------- PR Review: https://git.openjdk.org/jdk/pull/26125#pullrequestreview-3006614681 PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2198280733 From bulasevich at openjdk.org Thu Jul 10 17:46:20 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 10 Jul 2025 17:46:20 GMT Subject: [jdk25] RFR: 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate Message-ID: <2-vD19USVejKcQe9AuVa1tq9RCGdspEJ1JVbt5BVI_4=.8e45bdfc-8142-4408-b960-6b95aa338e53@github.com> This is the backport of the JVMCI metadata crash fix. Issue: When flushing nmethods via CodeBlob::purge(), the JVMCI metadata was freed (mutable_data) but its size fields remained non-zero. As a result, invoking heap analytics via jcmd Compiler.CodeHeap_Analytics still walks the purged metadata and calls jvmci_name() on arbitrary memory, leading to intermittent crashes Fix: Extend CodeBlob::purge() to zero out the _mutable_data_size, _relocation_size, and _metadata_size fields so that after a purge jvmci_data_size() returns 0 and CompileBroker::print_heapinfo() skips any JVMCI metadata ------------- Commit messages: - Backport 74822ce12acaf9816aa49b75ab5817ced3710776 Changes: https://git.openjdk.org/jdk/pull/26248/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26248&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8358183 Stats: 3 lines in 2 files changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26248.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26248/head:pull/26248 PR: https://git.openjdk.org/jdk/pull/26248 From dlunden at openjdk.org Thu Jul 10 18:25:43 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 10 Jul 2025 18:25:43 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: <375qzJdzdKWky3-EHgnSiksFYbJIPvfR27xzUTF6vRA=.9eb63509-1ec6-4cc6-bff0-782290866a4d@github.com> On Wed, 9 Jul 2025 01:23:43 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Disable auto-vectorization of double to short conversion for NEON and update tests @XiaohongGong The code changes look sane, although, for the record, I'm not that familiar with this part of HotSpot. Testing also looks good, details below. ### Testing - [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/16165935815) - `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. - Performance testing on DaCapo, Renaissance, SPECjbb, and SPECjvm on Linux x64 and macOS aarch64. No observable improvements nor regressions. ------------- Marked as reviewed by dlunden (Committer). PR Review: https://git.openjdk.org/jdk/pull/26057#pullrequestreview-3006812696 From yadongwang at openjdk.org Thu Jul 10 18:33:51 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Thu, 10 Jul 2025 18:33:51 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding Message-ID: The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. // The assembler store_check code will do an unsigned shift of the oop, // then add it to _byte_map_base, i.e. // // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) _byte_map = (CardValue*) rs.base(); _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. // Card Table Byte Map Base operand immByteMapBase() %{ // Get base of card map predicate((jbyte*)n->get_ptr() == ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); match(ConP); op_cost(0); format %{ %} interface(CONST_INTER); %} // Load Byte Map Base Constant instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) %{ match(Set dst con); ins_cost(INSN_COST); format %{ "adr $dst, $con\t# Byte Map Base" %} ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); ins_pipe(ialu_imm); %} As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: 0xffff25caf08c: ldaxr x8, [x11] 0xffff25caf090: cmp x10, x8 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any 0xffff25caf098: stlxr w8, x28, [x11] 0xffff25caf09c: cbnz w8, 0xffff25caf08c 0xffff25caf0a0: orr x11, xzr, #0x3 0xffff25caf0a4: str x11, [x13] 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none 0xffff25caf0ac: str x14, [sp] 0xffff25caf0b0: add x2, sp, #0x20 0xffff25caf0b4: adrp x1, 0xffff21730000 0xffff25caf0b8: bl 0xffff256fffc0 0xffff25caf0bc: ldr x14, [sp] 0xffff25caf0c0: b 0xffff25caef80 0xffff25caf0c4: add x13, sp, #0x20 0xffff25caf0c8: adrp x12, 0xffff21730000 0xffff25caf0cc: ldr x10, [x13] 0xffff25caf0d0: cmp x10, xzr 0xffff25caf0d4: b.eq 0xffff25caf130 // b.none 0xffff25caf0d8: ldr x11, [x12] 0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f8 0xffff25caf0e0: ldxr x11, [x12] 0xffff25caf0e4: cmp x13, x11 0xffff25caf0e8: b.ne 0xffff25caf130 // b.any 0xffff25caf0ec: stlxr w11, x10, [x12] 0xffff25caf0f0: cbz w11, 0xffff25caf130 0xffff25caf0f4: b 0xffff25caf0e0 Details see https://mail.openjdk.org/pipermail/aarch64-port-dev/2025-July/016021.html. ------------- Commit messages: - 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding Changes: https://git.openjdk.org/jdk/pull/26249/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26249&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361892 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26249.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26249/head:pull/26249 PR: https://git.openjdk.org/jdk/pull/26249 From yadongwang at openjdk.org Thu Jul 10 18:33:51 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Thu, 10 Jul 2025 18:33:51 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 18:29:14 GMT, Yadong Wang wrote: > The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. > > C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. > > // The assembler store_check code will do an unsigned shift of the oop, > // then add it to _byte_map_base, i.e. > // > // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) > _byte_map = (CardValue*) rs.base(); > _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); > > In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. > > // Card Table Byte Map Base > operand immByteMapBase() > %{ > // Get base of card map > predicate((jbyte*)n->get_ptr() == > ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); > match(ConP); > > op_cost(0); > format %{ %} > interface(CONST_INTER); > %} > > // Load Byte Map Base Constant > instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) > %{ > match(Set dst con); > > ins_cost(INSN_COST); > format %{ "adr $dst, $con\t# Byte Map Base" %} > > ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); > > ins_pipe(ialu_imm); > %} > > As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: > 0xffff25caf08c: ldaxr x8, [x11] > 0xffff25caf090: cmp x10, x8 > 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any > 0xffff25caf098: stlxr w8, x28, [x11] > 0xffff25caf09c: cbnz w8, 0xffff25caf08c > 0xffff25caf0a0: orr x11, xzr, #0x3 > 0xffff25caf0a4: str x11, [x13] > 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none > 0xffff25caf0ac: str x14, [sp] > 0xffff25caf0b0: add x2, sp, #0x20 > 0xffff25caf0b4: adrp x1, 0xffff21730000 > 0xffff25caf0b8: bl 0xffff256fffc0 > 0xffff25caf0bc: ldr x14, [sp] > 0xffff25caf0c0: b 0xffff25caef80 > 0xffff25caf0c4: add x13, sp, #0x20 > 0xffff25caf0c8: adrp x12, 0xffff21730000 > 0xffff25caf0cc: ldr x10, [x13] > 0xffff25caf0d0: cmp x10, xzr > 0xffff25caf0d4: b.eq 0xffff25caf130 // b.none > 0xffff25caf0d8: ldr x11, [x12] > 0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f... @theRealAph @adinn Could you help review it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3058497814 From dlunden at openjdk.org Thu Jul 10 19:10:43 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Thu, 10 Jul 2025 19:10:43 GMT Subject: RFR: 8342941: IGV: Add new graph dumps for post loop, empty loop removal, and one iteration removal [v3] In-Reply-To: <_p5Jj77u1VyyW0eVneXqeNjmngTvSvFi94_FALv6swk=.d4e5aec1-dd73-48ed-8d7f-3080207be763@github.com> References: <_p5Jj77u1VyyW0eVneXqeNjmngTvSvFi94_FALv6swk=.d4e5aec1-dd73-48ed-8d7f-3080207be763@github.com> Message-ID: <-qvrPep0_75olkxXj9BT74oMIHTfxwgshrHnqQC9BuU=.501e3840-2b5d-4c7c-b2fe-891a167c66d8@github.com> On Mon, 7 Jul 2025 23:04:51 GMT, Saranya Natarajan wrote: >> This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). >> >> Changes: >> - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. >> - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. >> - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. >> >> Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . >> 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` >> ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) >> 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled >> ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) >> 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` >> ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) >> 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` >> ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) >> >> Question to reviewers: >> Are the new compiler phases OK, or should we change anything? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - fix 2 of review > - Merge master > - Addressing review comments > - Initial Fix Looks good! Just one (very) minor comment. src/hotspot/share/opto/phasetype.hpp line 83: > 81: flags(AFTER_REMOVE_EMPTY_LOOP, "After Remove Empty Loop") \ > 82: flags(BEFORE_ONE_ITERATION_LOOP, "Before Replacing One Iteration Loop") \ > 83: flags(AFTER_ONE_ITERATION_LOOP, "After Replacing One Iteration Loop") \ Very much a nit, but I think this should be "One-Iteration Loop". Or, is it in fact one _iteration loop_ (as it reads now)? Looking at the code, I think it is the former. @chhagedorn can maybe clarify? This is not specific to your changeset, but also appears in existing source code comments. Maybe a good opportunity to clean this up everywhere? Also, maybe "Replacing" should be "Replace"? Seems to better fit the style used for other phase names. ------------- Marked as reviewed by dlunden (Committer). PR Review: https://git.openjdk.org/jdk/pull/25756#pullrequestreview-3006984528 PR Review Comment: https://git.openjdk.org/jdk/pull/25756#discussion_r2198498500 From aph at openjdk.org Thu Jul 10 21:45:47 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 10 Jul 2025 21:45:47 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 18:30:18 GMT, Yadong Wang wrote: > @theRealAph @adinn Could you help review it? Can't you just delete the special handling of byte_map_base? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3059199408 From iveresov at openjdk.org Thu Jul 10 22:41:39 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Thu, 10 Jul 2025 22:41:39 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data [v2] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 17:10:22 GMT, Igor Veresov wrote: >> Use OopStorage directly instead of JNI handles. Note that we never destroy TrainingData objects, so we don't need to concern ourselves with freeing the OopStorage entries. Also, keeping the klasses alive is only necessary during the training run. During the replay the klasses TD objects refer to are always alive. > > Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Testing is good. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26233#issuecomment-3059360956 From dlong at openjdk.org Thu Jul 10 22:45:42 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 10 Jul 2025 22:45:42 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v4] In-Reply-To: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> References: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> Message-ID: On Wed, 9 Jul 2025 10:07:31 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Revert RISCV Macro modification > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses > - riscv: fix c1 primitive array clone intrinsic regression src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp line 775: > 773: arraycopy_helper(x, &flags, &expected_type); > 774: if (x->check_flag(Instruction::OmitChecksFlag)) { > 775: flags = (flags & LIR_OpArrayCopy::unaligned); Should be LIR_OpArrayCopy::unaligned|LIR_OpArrayCopy::overlapping? See below. src/hotspot/share/c1/c1_LIR.cpp line 353: > 351: , _expected_type(expected_type) > 352: , _flags(flags) { > 353: #if defined(X86) || defined(AARCH64) || defined(S390) || defined(RISCV64) || defined(PPC64) Do we still need this #if? It would be nice if we can eventually remove it, but I guess arm32 support is missing. src/hotspot/share/c1/c1_LIR.cpp line 354: > 352: , _flags(flags) { > 353: #if defined(X86) || defined(AARCH64) || defined(S390) || defined(RISCV64) || defined(PPC64) > 354: if (expected_type != nullptr && ((flags & ~LIR_OpArrayCopy::unaligned) == 0)) { I was concerned that this is platform-specific, but I checked and all platforms can handle unaligned or overlapping w/o using the stub. So maybe this should be using LIR_OpArrayCopy::unaligned|LIR_OpArrayCopy::overlapping? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2198916477 PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2198915263 PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2198911343 From duke at openjdk.org Fri Jul 11 00:37:10 2025 From: duke at openjdk.org (Guanqiang Han) Date: Fri, 11 Jul 2025 00:37:10 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v4] In-Reply-To: References: Message-ID: > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: - Remove the unused variable - Merge remote-tracking branch 'upstream/master' into 8361140 - update regression test - Merge remote-tracking branch 'upstream/master' into 8361140 - update modification and add regression test - Merge remote-tracking branch 'upstream/master' into 8361140 - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26125/files - new: https://git.openjdk.org/jdk/pull/26125/files/fd6f90f5..0e9aa956 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26125&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26125&range=02-03 Stats: 377 lines in 16 files changed: 113 ins; 179 del; 85 mod Patch: https://git.openjdk.org/jdk/pull/26125.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26125/head:pull/26125 PR: https://git.openjdk.org/jdk/pull/26125 From duke at openjdk.org Fri Jul 11 00:40:39 2025 From: duke at openjdk.org (Guanqiang Han) Date: Fri, 11 Jul 2025 00:40:39 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v3] In-Reply-To: References: Message-ID: <1o9SmpZlm3T6YV6qDof8tAyDgDY1_f1vsgu-9IeC3jU=.4dfbea5d-a66b-4332-8edc-e7a6a0d6871b@github.com> On Thu, 10 Jul 2025 17:17:04 GMT, Cesar Soares Lucas wrote: >> Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - update regression test >> - Merge remote-tracking branch 'upstream/master' into 8361140 >> - update modification and add regression test >> - Merge remote-tracking branch 'upstream/master' into 8361140 >> - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp >> >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. > > test/hotspot/jtreg/compiler/c2/TestReducePhiOnCmpWithNoOptPtrCompare.java line 47: > >> 45: @Run(test = {"testReducePhiOnCmp_C2"}) >> 46: public void runner(RunInfo info) { >> 47: invocations++; > > I don't think you need this variable anymroe. hi @JohnTortugo , Thanks for the feedback! I overlooked a small detail, I've fixed it now. Let me know if there's anything else. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26125#discussion_r2199100535 From xgong at openjdk.org Fri Jul 11 01:29:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 11 Jul 2025 01:29:44 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 05:53:44 GMT, Xiaohong Gong wrote: >> @mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply. >> >> However, I do plan to remove the auto-vectorization restrictions for simple reductions. >> https://bugs.openjdk.org/browse/JDK-8307516 >> >> You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`. >> https://bugs.openjdk.org/browse/JDK-8357530 >> I published benchmark results there: >> https://github.com/openjdk/jdk/pull/25387 >> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable. >> >> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;) >> >> I don't have access to any SVE machines, so I cannot help you there, unfortunately. >> >> Is this helpful to you? > >> You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable. >> >> It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;) >> >> I don't have access to any SVE machines, so I cannot help you there, unfortunately. >> >>Is this helpful to you? > > Thanks for your input @eme64 ! It's really helpful to me. And it would be the right direction that using the cost model to guide whether vectorizing FP mul reduction is profitable or not. With this, I think the backend check of auto-vectorization for such operations can be removed safely. We can relay on the SLP's analysis. > > BTW, the current profitability heuristics can provide help on disabling auto-vectorization for the simple cases while enabling the complex ones. This is also helpful to us. > > I tested the performance of `VectorReduction2` with/without auto-vectorization for FP mul reductions on my SVE 128-bit machine. The performance difference is not very significant for both `floatMulSimple` and `floatMulBig`. But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! If the performance of `floatMulBig` is improved with auto-vectorization, I think we can remove the limitation of such reductions for auto-vectorization on AArch64. > @XiaohongGong , @shqking , @eme64 , > > Thank you all for the insightful and detailed comments! I really appreciate the effort to explore the performance implications of auto-vectorization cases. I agree it would be helpful if @fg1417 could join this discussion. However, before diving deeper, I?d like to clarify the problem statement as we see it. I've also updated the JBS ticket accordingly, and I?m citing the key part here for visibility: > > > To clarify, the goal of this ticket is to improve the performance of mul reduction VectorAPI operations on SVE-capable platforms with vector lengths greater than 128 bits (e.g., Neoverse V1). The core issue is that these APIs are not being lowered to any AArch64 implementation at all on such platforms. Instead, the fallback Java implementation is used. > > This PR does **not** target improvements in auto-vectorization. In the context of auto-vectorization, the scope of this PR is limited to maintaining correctness and avoiding regressions. > > @shqking , regarding the case-2 that you highlighted - I believe this change is incidental. Prior to the patch, `Matcher::match_rule_supported_auto_vectorization()` returned false for NEON platforms (as expected) and true for 128-bit SVE. This behavior is misleading because HotSpot currently uses the **same scalar mul reduction implementation** for both NEON and SVE platforms. Since this implementation is unprofitable on both, it should have been disabled across the board. @fg1417, please correct me if I?m mistaken. > > This PR cannot leave `Matcher::match_rule_supported_auto_vectorization()` unchanged. If we do, HotSpot will select the strictly-ordered FP vector reduction implementation, which is not performant. A more efficient SVE-based implementation can't be used due to the strict ordering requirement. > > @XiaohongGong , > > > But I guess the performance change would be different with auto-vectorization on HWs with larger vector size. As we do not have the SVE machines with larger vector size as well, we may need help from @mikabl-arm ! > > Here are performance numbers for Neoverse V1 with the auto-vectorization restriction in `Matcher::match_rule_supported_auto_vectorization()` lifted (`After`). The linear strictly-ordered SVE implementation matched this way was later removed by [4593a5d](https://github.com/openjdk/jdk/commit/4593a5d717024df01769625993c2b769d8dde311). > > ``` > | Benchmark | Before (ns/op) | After (ns/op) | Diff (%) | > |:-----------------------------------------------|-----------------:|----------------:|:-----------| > | VectorReduction.WithSuperword.mulRedD | 401.679 | 401.704 | ~ | > | VectorReduction2.WithSuperword.doubleMulBig | 2365.554 | 7294.706 | +208.37% | > | VectorReduction2.WithSuperword.doubleMulSimple | 2321.154 | 2321.207 | ~ | > | VectorReduction2.WithSuperword.floatMulBig | 2356.006 | 2648.334 | +12.41% | > | VectorReduction2.WithSuperword.floatMulSimple | 2321.018 | 2321.135 | ~ | > ``` > > Given that: > > * this PR focuses on VectorAPI and **not** on auto-vectorization, > * and it does **not** introduce regressions in auto-vectorization performance, > > I suggest: > > * continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop; > * moving forward with resolving the remaining VectorAPI issues and merging this PR. I'm fine with removing the strict-ordered rules and disable these operations for SLP since it does not benefit performance. Thanks for your testing and updating! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3059883936 From yadongwang at openjdk.org Fri Jul 11 01:34:38 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 11 Jul 2025 01:34:38 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 21:43:31 GMT, Andrew Haley wrote: > > @theRealAph @adinn Could you help review it? > > Can't you just delete the special handling of byte_map_base? It's fine to just delete immByteMapBase, and then ConP for byte_map_base will match immP in aarch64_enc_mov_p, and using adrp+add if valid address and using mov if unvalid. enc_class aarch64_enc_mov_p(iRegP dst, immP src) %{ Register dst_reg = as_Register($dst$$reg); address con = (address)$src$$constant; if (con == nullptr || con == (address)1) { ShouldNotReachHere(); } else { relocInfo::relocType rtype = $src->constant_reloc(); if (rtype == relocInfo::oop_type) { __ movoop(dst_reg, (jobject)con); } else if (rtype == relocInfo::metadata_type) { __ mov_metadata(dst_reg, (Metadata*)con); } else { assert(rtype == relocInfo::none, "unexpected reloc type"); if (! __ is_valid_AArch64_address(con) || con < (address)(uintptr_t)os::vm_page_size()) { __ mov(dst_reg, con); } else { uint64_t offset; __ adrp(dst_reg, con, offset); __ add(dst_reg, dst_reg, offset); } } } %} I can choose to delete it directly, but do you think the current changes in pr are safe? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3059891654 From xgong at openjdk.org Fri Jul 11 01:42:44 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 11 Jul 2025 01:42:44 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: Message-ID: <6H9X-NXKOGd9BZVhTDiKNf7OO2KQTciRKGnXY-5C9yA=.e25f9e69-44c2-48d1-b4e3-cb8f1af79546@github.com> On Thu, 10 Jul 2025 13:53:25 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: > > remove the strictly-ordered FP implementation as unused src/hotspot/share/opto/loopopts.cpp line 4715: > 4713: Node* last_accumulator = phi->in(2); > 4714: Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt, > 4715: /* requires_strict_order */ false); Why do you change this? Before it requires strict order, but now it is false. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199260581 From xgong at openjdk.org Fri Jul 11 01:45:57 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 11 Jul 2025 01:45:57 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: <5__MowepmHboqoXdzv0AEbzawJobhlsMoAHjAmYrCno=.b42d0d72-54bd-4ccc-a29b-98877bbedcbb@github.com> On Thu, 10 Jul 2025 01:40:06 GMT, Xiaohong Gong wrote: >> Thanks for making the changes. Looks good to me. > >> Thanks for making the changes. Looks good to me. > > Thanks a lot for your review! > @XiaohongGong The code changes look sane, although, for the record, I'm not that familiar with this part of HotSpot. Testing also looks good, details below. > > ### Testing > * [GitHub Actions](https://github.com/dlunde/jdk/actions/runs/16165935815) > * `tier1` to `tier4` (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. > * Performance testing on DaCapo, Renaissance, SPECjbb, and SPECjvm on Linux x64 and macOS aarch64. No observable improvements nor regressions. Great! Thanks a lot for your testing and review~ ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3059921607 From xgong at openjdk.org Fri Jul 11 01:50:40 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 11 Jul 2025 01:50:40 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <7VTRz_XqYBSdQ54n7ADzTzYADZjDbgBw6XuW0jehSLI=.24d18b87-4553-47ab-8065-d92fbb5700ae@github.com> Message-ID: On Fri, 4 Jul 2025 09:11:40 GMT, Andrew Haley wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Refine the comment in ad file > > This looks good. Thanks. Hi @theRealAph , would you mind taking another look at the latest change? It needs an approval from a reviewer. Thanks a lot in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3059929091 From dlong at openjdk.org Fri Jul 11 01:56:38 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 11 Jul 2025 01:56:38 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: <4cyCh_dwsH1VWmffOFcX-AYYvYu0kmrlb-OYzJYl9lw=.cd910298-6863-42fd-ba31-e9b6c60b7887@github.com> On Thu, 10 Jul 2025 18:29:14 GMT, Yadong Wang wrote: > The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. > > C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. > > // The assembler store_check code will do an unsigned shift of the oop, > // then add it to _byte_map_base, i.e. > // > // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) > _byte_map = (CardValue*) rs.base(); > _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); > > In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. > > // Card Table Byte Map Base > operand immByteMapBase() > %{ > // Get base of card map > predicate((jbyte*)n->get_ptr() == > ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); > match(ConP); > > op_cost(0); > format %{ %} > interface(CONST_INTER); > %} > > // Load Byte Map Base Constant > instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) > %{ > match(Set dst con); > > ins_cost(INSN_COST); > format %{ "adr $dst, $con\t# Byte Map Base" %} > > ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); > > ins_pipe(ialu_imm); > %} > > As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: > 0xffff25caf08c: ldaxr x8, [x11] > 0xffff25caf090: cmp x10, x8 > 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any > 0xffff25caf098: stlxr w8, x28, [x11] > 0xffff25caf09c: cbnz w8, 0xffff25caf08c > 0xffff25caf0a0: orr x11, xzr, #0x3 > 0xffff25caf0a4: str x11, [x13] > 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none > 0xffff25caf0ac: str x14, [sp] > 0xffff25caf0b0: add x2, sp, #0x20 > 0xffff25caf0b4: adrp x1, 0xffff21730000 > 0xffff25caf0b8: bl 0xffff256fffc0 > 0xffff25caf0bc: ldr x14, [sp] > 0xffff25caf0c0: b 0xffff25caef80 > 0xffff25caf0c4: add x13, sp, #0x20 > 0xffff25caf0c8: adrp x12, 0xffff21730000 > 0xffff25caf0cc: ldr x10, [x13] > 0xffff25caf0d0: cmp x10, xzr > 0xffff25caf0d4: b.eq 0xffff25caf130 // b.none > 0xffff25caf0d8: ldr x11, [x12] > 0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f... It looks like riscv.ad has the same problem. I think Andrew's suggestion is safe. We will emit the correct code based on relocType. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3059937033 From haosun at openjdk.org Fri Jul 11 02:04:42 2025 From: haosun at openjdk.org (Hao Sun) Date: Fri, 11 Jul 2025 02:04:42 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 13:53:25 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: > > remove the strictly-ordered FP implementation as unused src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1995: > 1993: // Vector reduction multiply for integral type with ASIMD instructions. > 1994: // Note: temporary registers vtmp1 and vtmp2 are not used in some cases. > 1995: // Note: vsrc and vtmp2 may match. I left a comment in this "resolved comment thread" several days ago. See https://github.com/openjdk/jdk/pull/23181/files#r2179185158. It might be overlooked since the whole conversation was marked as resolved already. I personally think we should not allow `vsrc` and `vtmp2` to match. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199290187 From haosun at openjdk.org Fri Jul 11 02:04:43 2025 From: haosun at openjdk.org (Hao Sun) Date: Fri, 11 Jul 2025 02:04:43 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: <6H9X-NXKOGd9BZVhTDiKNf7OO2KQTciRKGnXY-5C9yA=.e25f9e69-44c2-48d1-b4e3-cb8f1af79546@github.com> References: <6H9X-NXKOGd9BZVhTDiKNf7OO2KQTciRKGnXY-5C9yA=.e25f9e69-44c2-48d1-b4e3-cb8f1af79546@github.com> Message-ID: <_gHaFQTNq2bApeWAE88cWxcNULRDqndSSo3hrY31FgI=.132b7c24-7205-4877-9b95-3d9d13ac7ec8@github.com> On Fri, 11 Jul 2025 01:39:11 GMT, Xiaohong Gong wrote: >> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: >> >> remove the strictly-ordered FP implementation as unused > > src/hotspot/share/opto/loopopts.cpp line 4715: > >> 4713: Node* last_accumulator = phi->in(2); >> 4714: Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt, >> 4715: /* requires_strict_order */ false); > > Why do you change this? Before it requires strict order, but now it is false. IIUC, it's a correction here. As noted by this function name `move_unordered_reduction_out_of_loop()` and the comment before this function, **unordered reduction** is expected to be generated. Hence, we should specify `/* requires_strict_order */ false` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199274046 From haosun at openjdk.org Fri Jul 11 02:07:41 2025 From: haosun at openjdk.org (Hao Sun) Date: Fri, 11 Jul 2025 02:07:41 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v3] In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 05:23:38 GMT, Emanuel Peter wrote: >> This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks. >> >> ### Neoverse N1 (NEON) >> >>
>> >> Auto-vectorization >> >> | Benchmark | Before | After | Units | Diff | >> |---------------------------|----------|----------|-------|------| >> | mulRedD | 739.699 | 740.884 | ns/op | ~ | >> | byteAddBig | 2670.248 | 2670.562 | ns/op | ~ | >> | byteAddSimple | 1639.796 | 1639.940 | ns/op | ~ | >> | byteMulBig | 2707.900 | 2708.063 | ns/op | ~ | >> | byteMulSimple | 2452.939 | 2452.906 | ns/op | ~ | >> | charAddBig | 2772.363 | 2772.269 | ns/op | ~ | >> | charAddSimple | 1639.867 | 1639.751 | ns/op | ~ | >> | charMulBig | 2796.533 | 2796.375 | ns/op | ~ | >> | charMulSimple | 2453.034 | 2453.004 | ns/op | ~ | >> | doubleAddBig | 2943.613 | 2936.897 | ns/op | ~ | >> | doubleAddSimple | 1635.031 | 1634.797 | ns/op | ~ | >> | doubleMulBig | 3001.937 | 3003.240 | ns/op | ~ | >> | doubleMulSimple | 2448.154 | 2448.117 | ns/op | ~ | >> | floatAddBig | 2963.086 | 2962.215 | ns/op | ~ | >> | floatAddSimple | 1634.987 | 1634.798 | ns/op | ~ | >> | floatMulBig | 3022.442 | 3021.356 | ns/op | ~ | >> | floatMulSimple | 2447.976 | 2448.091 | ns/op | ~ | >> | intAddBig | 832.346 | 832.382 | ns/op | ~ | >> | intAddSimple | 841.276 | 841.287 | ns/op | ~ | >> | intMulBig | 1245.155 | 1245.095 | ns/op | ~ | >> | intMulSimple | 1638.762 | 1638.826 | ns/op | ~ | >> | longAddBig | 4924.541 | 4924.328 | ns/op | ~ | >> | longAddSimple | 841.623 | 841.625 | ns/op | ~ | >> | longMulBig | 9848.954 | 9848.807 | ns/op | ~ | >> | longMulSimple | 3427.169 | 3427.279 | ns/op | ~ | >> | shortAddBig | 2670.027 | 2670.345 | ns/op | ~ | >> | shortAddSimple | 1639.869 | 1639.876 | ns/op | ~ | >> | shortMulBig | 2750.812 | 2750.562 | ns/op | ~ | >> | shortMulSimple | 2453.030 | 2452.937 | ns/op | ~ | >> >>... > > @mikabl-arm @XiaohongGong I'm a little busy these weeks before going on vacation, so I won't have time to look into this more deeply. > > However, I do plan to remove the auto-vectorization restrictions for simple reductions. > https://bugs.openjdk.org/browse/JDK-8307516 > > You can already now disable the (bad) reduction heuristic, using `AutoVectorizationOverrideProfitability`. > https://bugs.openjdk.org/browse/JDK-8357530 > I published benchmark results there: > https://github.com/openjdk/jdk/pull/25387 > You can see that enabling simple reductions is in most cases actually profitable now. But float/double add and mul have strict reduction order, and that usually prevents vectorization from being profitable. The strict-order vector reduction is quite expensive, and it only becomes beneficial if there is a lot of other code in the loop that can be vectorized. Soon, I plan to add a cost-model, so that we can predict if vectorization is profitable. > > It would also be nice to actually find a benchmark where float add/mul reductions lead to a speedup with vectorization. So far I have not seen any example in my benchmarks: https://github.com/openjdk/jdk/pull/25387 If you find any such example, please let me know ;) > > I don't have access to any SVE machines, so I cannot help you there, unfortunately. > > Is this helpful to you? @eme64 Thanks for your input. It's very helpful to us. @fg1417 Thanks for your clarification on `case-2` as I mentioned earlier. @mikabl-arm Thanks for your providing the performance data on Neoverse-V1 machine. > Given that: > > * this PR focuses on VectorAPI and **not** on auto-vectorization, > * and it does **not** introduce regressions in auto-vectorization performance, > > I suggest: > > * continuing the discussion on auto-vectorization separately on hotspot-dev, including @fg1417 in the loop; > * moving forward with resolving the remaining VectorAPI issues and merging this PR. I agree with your suggestion. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3059975539 From yadongwang at openjdk.org Fri Jul 11 02:25:38 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 11 Jul 2025 02:25:38 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: <4cyCh_dwsH1VWmffOFcX-AYYvYu0kmrlb-OYzJYl9lw=.cd910298-6863-42fd-ba31-e9b6c60b7887@github.com> References: <4cyCh_dwsH1VWmffOFcX-AYYvYu0kmrlb-OYzJYl9lw=.cd910298-6863-42fd-ba31-e9b6c60b7887@github.com> Message-ID: On Fri, 11 Jul 2025 01:54:00 GMT, Dean Long wrote: > It looks like riscv.ad has the same problem. I think Andrew's suggestion is safe. We will emit the correct code based on relocType. Yes, it maybe a better solution for jdk main line, because immPollPage was remove in https://bugs.openjdk.org/browse/JDK-8220051. But how about jdk8u backport? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3060031013 From fjiang at openjdk.org Fri Jul 11 02:30:36 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 11 Jul 2025 02:30:36 GMT Subject: RFR: 8361829: [TESTBUG] RISC-V: compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb In-Reply-To: References: Message-ID: <6TiCRs3q6fScF7OgINGrqs4T_fpvleuVZ76EJz9hYGQ=.13d9810c-0941-454e-888d-06ea10075ce7@github.com> On Thu, 10 Jul 2025 08:31:53 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After JDK-8355293 , compiler/vectorization/runner/BasicIntOpTest.java failswith RVV but not Zvbb. > The reason for the error is that `PopCountVI` on RISC-V requires `Zvbb`, not just `RVV`. > > ### Test > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on k1 > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on qemu-system (enable RVV) w/ and w/o zvbb Marked as reviewed by fjiang (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26238#pullrequestreview-3008247649 From haosun at openjdk.org Fri Jul 11 02:31:43 2025 From: haosun at openjdk.org (Hao Sun) Date: Fri, 11 Jul 2025 02:31:43 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 11:40:29 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Change match rule names to lowercase src/hotspot/cpu/aarch64/aarch64_vector.ad line 7189: > 7187: effect(TEMP_DEF dst, TEMP tmp); > 7188: match(Set dst (SelectFromTwoVector (Binary index src1) src2)); > 7189: format %{ "vselect_from_two_vectors_Neon_10_11 $dst, $src1, $src2, $index\t# vector (8B/16B/4S/8S/2I/4I/2F/4F). KILL $tmp" %} nit: here and several other sites. We also need use lower cases in the `format` clause. Suggestion: format %{ "vselect_from_two_vectors_neon_10_11 $dst, $src1, $src2, $index\t# vector (8B/16B/4S/8S/2I/4I/2F/4F). KILL $tmp" %} ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2199340543 From dzhang at openjdk.org Fri Jul 11 02:36:44 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 11 Jul 2025 02:36:44 GMT Subject: RFR: 8361829: [TESTBUG] RISC-V: compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb In-Reply-To: References: Message-ID: <4CYPB69Qkgv9HGGCHVUKYq1EmU_ue5vN4ZugZfEVRic=.2b192ddc-c47f-442f-91c8-88fb7e804840@github.com> On Thu, 10 Jul 2025 08:31:53 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) , compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb. > The reason for the error is that `PopCountVI` on RISC-V requires `Zvbb`, not just `RVV`. > > ### Test > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on k1 > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on qemu-system (enable RVV) w/ and w/o zvbb Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26238#issuecomment-3060087390 From duke at openjdk.org Fri Jul 11 02:36:45 2025 From: duke at openjdk.org (duke) Date: Fri, 11 Jul 2025 02:36:45 GMT Subject: RFR: 8361829: [TESTBUG] RISC-V: compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 08:31:53 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) , compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb. > The reason for the error is that `PopCountVI` on RISC-V requires `Zvbb`, not just `RVV`. > > ### Test > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on k1 > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on qemu-system (enable RVV) w/ and w/o zvbb @DingliZhang Your change (at version f33d319a7eea937a2f2baa1aef7a4b072c93eb2a) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26238#issuecomment-3060091029 From dzhang at openjdk.org Fri Jul 11 02:43:42 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Fri, 11 Jul 2025 02:43:42 GMT Subject: Integrated: 8361829: [TESTBUG] RISC-V: compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 08:31:53 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > After [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) , compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb. > The reason for the error is that `PopCountVI` on RISC-V requires `Zvbb`, not just `RVV`. > > ### Test > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on k1 > - [x] Run compiler/vectorization/runner/BasicIntOpTest.java on qemu-system (enable RVV) w/ and w/o zvbb This pull request has now been integrated. Changeset: 2e7e272d Author: Dingli Zhang Committer: Feilong Jiang URL: https://git.openjdk.org/jdk/commit/2e7e272d7b5273bae8684095bcda2a9c8bd21dc8 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8361829: [TESTBUG] RISC-V: compiler/vectorization/runner/BasicIntOpTest.java fails with RVV but not Zvbb Reviewed-by: fyang, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/26238 From dlong at openjdk.org Fri Jul 11 03:41:37 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 11 Jul 2025 03:41:37 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 18:29:14 GMT, Yadong Wang wrote: > The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. > > C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. > > // The assembler store_check code will do an unsigned shift of the oop, > // then add it to _byte_map_base, i.e. > // > // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) > _byte_map = (CardValue*) rs.base(); > _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); > > In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. > > // Card Table Byte Map Base > operand immByteMapBase() > %{ > // Get base of card map > predicate((jbyte*)n->get_ptr() == > ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); > match(ConP); > > op_cost(0); > format %{ %} > interface(CONST_INTER); > %} > > // Load Byte Map Base Constant > instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) > %{ > match(Set dst con); > > ins_cost(INSN_COST); > format %{ "adr $dst, $con\t# Byte Map Base" %} > > ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); > > ins_pipe(ialu_imm); > %} > > As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: > 0xffff25caf08c: ldaxr x8, [x11] > 0xffff25caf090: cmp x10, x8 > 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any > 0xffff25caf098: stlxr w8, x28, [x11] > 0xffff25caf09c: cbnz w8, 0xffff25caf08c > 0xffff25caf0a0: orr x11, xzr, #0x3 > 0xffff25caf0a4: str x11, [x13] > 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none > 0xffff25caf0ac: str x14, [sp] > 0xffff25caf0b0: add x2, sp, #0x20 > 0xffff25caf0b4: adrp x1, 0xffff21730000 > 0xffff25caf0b8: bl 0xffff256fffc0 > 0xffff25caf0bc: ldr x14, [sp] > 0xffff25caf0c0: b 0xffff25caef80 > 0xffff25caf0c4: add x13, sp, #0x20 > 0xffff25caf0c8: adrp x12, 0xffff21730000 > 0xffff25caf0cc: ldr x10, [x13] > 0xffff25caf0d0: cmp x10, xzr > 0xffff25caf0d4: b.eq 0xffff25caf130 // b.none > 0xffff25caf0d8: ldr x11, [x12] > 0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f... Let me see if I can unpack your question. Yes jdk8u still uses immPollPage, which is similar to immByteMapBase, but I don't think it has the same problem, because I don't see how an oop could have the same address as the polling page. If we fix riscv.ad now, there won't be a clean backport to jdk8u because the riscv port doesn't exist. For aarch64, either of the two proposed fixes could be backported to jdk8u, correct? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3060262075 From yadongwang at openjdk.org Fri Jul 11 04:53:37 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 11 Jul 2025 04:53:37 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 03:39:05 GMT, Dean Long wrote: > Let me see if I can unpack your question. Yes jdk8u still uses immPollPage, which is similar to immByteMapBase, but I don't think it has the same problem, because I don't see how an oop could have the same address as the polling page. If we fix riscv.ad now, there won't be a clean backport to jdk8u because the riscv port doesn't exist. For aarch64, either of the two proposed fixes could be backported to jdk8u, correct? Yes, but byte_map_base can be same address as the polling page in jdk8u, in both 2 proposed fixes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3060506873 From haosun at openjdk.org Fri Jul 11 05:02:40 2025 From: haosun at openjdk.org (Hao Sun) Date: Fri, 11 Jul 2025 05:02:40 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 18:29:14 GMT, Yadong Wang wrote: > The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. > > C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. > > // The assembler store_check code will do an unsigned shift of the oop, > // then add it to _byte_map_base, i.e. > // > // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) > _byte_map = (CardValue*) rs.base(); > _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); > > In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. > > // Card Table Byte Map Base > operand immByteMapBase() > %{ > // Get base of card map > predicate((jbyte*)n->get_ptr() == > ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); > match(ConP); > > op_cost(0); > format %{ %} > interface(CONST_INTER); > %} > > // Load Byte Map Base Constant > instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) > %{ > match(Set dst con); > > ins_cost(INSN_COST); > format %{ "adr $dst, $con\t# Byte Map Base" %} > > ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); > > ins_pipe(ialu_imm); > %} > > As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: > 0xffff25caf08c: ldaxr x8, [x11] > 0xffff25caf090: cmp x10, x8 > 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any > 0xffff25caf098: stlxr w8, x28, [x11] > 0xffff25caf09c: cbnz w8, 0xffff25caf08c > 0xffff25caf0a0: orr x11, xzr, #0x3 > 0xffff25caf0a4: str x11, [x13] > 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none > 0xffff25caf0ac: str x14, [sp] > 0xffff25caf0b0: add x2, sp, #0x20 > 0xffff25caf0b4: adrp x1, 0xffff21730000 > 0xffff25caf0b8: bl 0xffff256fffc0 > 0xffff25caf0bc: ldr x14, [sp] > 0xffff25caf0c0: b 0xffff25caef80 > 0xffff25caf0c4: add x13, sp, #0x20 > 0xffff25caf0c8: adrp x12, 0xffff21730000 > 0xffff25caf0cc: ldr x10, [x13] > 0xffff25caf0d0: cmp x10, xzr > 0xffff25caf0d4: b.eq 0xffff25caf130 // b.none > 0xffff25caf0d8: ldr x11, [x12] > 0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f... Hi. Not a code review. Our internal CI reported a jtreg failure `test/hotspot/jtreg/sources/TestNoNULL.java` with this PR on Nvidia Grace machine. - How to reproduce the failure: build one fastdebug JDK and run `make test TEST="test/hotspot/jtreg/sources/TestNoNULL.java"` - The snippet of the error log STDERR: Error: 'NULL' found in /tmp/src/hotspot/cpu/aarch64/aarch64.ad at line 4563: n->get_ptr_type()->isa_rawptr() != NULL && java.lang.RuntimeException: Test found 1 usages of 'NULL' in source files. See errors above. at TestNoNULL.main(TestNoNULL.java:84) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:565) at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) at java.base/java.lang.Thread.run(Thread.java:1474) JavaTest Message: Test threw exception: java.lang.RuntimeException JavaTest Message: shutting down test TEST RESULT: Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Test found 1 usages of 'NULL' in source files. See errors above. It seems that the newly added assertion by this PR is hit in this jtreg case. Could you help take a look at this issue? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3060528852 From mablakatov at openjdk.org Fri Jul 11 05:50:53 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Fri, 11 Jul 2025 05:50:53 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: <_gHaFQTNq2bApeWAE88cWxcNULRDqndSSo3hrY31FgI=.132b7c24-7205-4877-9b95-3d9d13ac7ec8@github.com> References: <6H9X-NXKOGd9BZVhTDiKNf7OO2KQTciRKGnXY-5C9yA=.e25f9e69-44c2-48d1-b4e3-cb8f1af79546@github.com> <_gHaFQTNq2bApeWAE88cWxcNULRDqndSSo3hrY31FgI=.132b7c24-7205-4877-9b95-3d9d13ac7ec8@github.com> Message-ID: On Fri, 11 Jul 2025 01:53:48 GMT, Hao Sun wrote: >> src/hotspot/share/opto/loopopts.cpp line 4715: >> >>> 4713: Node* last_accumulator = phi->in(2); >>> 4714: Node* post_loop_reduction = ReductionNode::make(sopc, nullptr, init, last_accumulator, bt, >>> 4715: /* requires_strict_order */ false); >> >> Why do you change this? Before it requires strict order, but now it is false. > > IIUC, it's a correction here. > > As noted by this function name `move_unordered_reduction_out_of_loop()` and the comment before this function, **unordered reduction** is expected to be generated. Hence, we should specify `/* requires_strict_order */ false` Precisely that, @shqking , thank you. I found this while evaluating the effect the patch has on auto-vectorization. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199671921 From thartmann at openjdk.org Fri Jul 11 06:12:39 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Jul 2025 06:12:39 GMT Subject: [jdk25] RFR: 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate In-Reply-To: <2-vD19USVejKcQe9AuVa1tq9RCGdspEJ1JVbt5BVI_4=.8e45bdfc-8142-4408-b960-6b95aa338e53@github.com> References: <2-vD19USVejKcQe9AuVa1tq9RCGdspEJ1JVbt5BVI_4=.8e45bdfc-8142-4408-b960-6b95aa338e53@github.com> Message-ID: <2HGFSgxusQvt-rWKn6X-vrLEuabgg25mxfUc4Lh84h4=.80949534-2eec-4906-a894-08689990c138@github.com> On Thu, 10 Jul 2025 17:40:20 GMT, Boris Ulasevich wrote: > This is the backport of the JVMCI metadata crash fix. > > Issue: > When flushing nmethods via CodeBlob::purge(), the JVMCI metadata was freed (mutable_data) but its size fields remained non-zero. As a result, invoking heap analytics via jcmd Compiler.CodeHeap_Analytics still walks the purged metadata and calls jvmci_name() on arbitrary memory, leading to intermittent crashes > > Fix: > Extend CodeBlob::purge() to zero out the _mutable_data_size, _relocation_size, and _metadata_size fields so that after a purge jvmci_data_size() returns 0 and CompileBroker::print_heapinfo() skips any JVMCI metadata Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26248#pullrequestreview-3008753772 From xgong at openjdk.org Fri Jul 11 06:19:49 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 11 Jul 2025 06:19:49 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: <6H9X-NXKOGd9BZVhTDiKNf7OO2KQTciRKGnXY-5C9yA=.e25f9e69-44c2-48d1-b4e3-cb8f1af79546@github.com> <_gHaFQTNq2bApeWAE88cWxcNULRDqndSSo3hrY31FgI=.132b7c24-7205-4877-9b95-3d9d13ac7ec8@github.com> Message-ID: On Fri, 11 Jul 2025 05:48:03 GMT, Mikhail Ablakatov wrote: >> IIUC, it's a correction here. >> >> As noted by this function name `move_unordered_reduction_out_of_loop()` and the comment before this function, **unordered reduction** is expected to be generated. Hence, we should specify `/* requires_strict_order */ false` > > Precisely that, @shqking , thank you. I found this while evaluating the effect the patch has on auto-vectorization. I see. Thanks! So what is the type of bt? Is it an integer type of floating-point one? If it's an integer type, I think changing or not does not have difference. But if it is floating-point type, we do not support the non-strict-ordered anyway and they are not enabled in SLP on AArch64. I'm just curious whether this change has any relationship with this PR. If not, I suggest not touching it now. Seems there is the same change in this PR https://github.com/openjdk/jdk/pull/23181. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199721512 From thartmann at openjdk.org Fri Jul 11 06:21:41 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Jul 2025 06:21:41 GMT Subject: RFR: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly turncated for byte and short [v9] In-Reply-To: References: Message-ID: <2_r-6BS4ZWqGUWmywg2ZfSkv9k-ZDehixrwoyQY3vrs=.fa426c0a-bbb6-4030-b649-4012381599a1@github.com> On Mon, 30 Jun 2025 12:59:24 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch fixes cases in SuperWord when compiling subword types where vectorized code would be given a narrower type than expected, leading to miscompilation due to truncation. This fix is a generalization of the same fix applied for `Integer.reverseBytes` in [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). The patch introduces a check for nodes that are known to tolerate truncation, so that any future cases of subword truncation will avoid creating miscompiled code. >> >> The patch reuses the existing logic to set the type of the vectors to int, which currently disables vectorization for the affected patterns entirely. Once [JDK-8342095](https://bugs.openjdk.org/browse/JDK-8342095) is merged and automatic casting support is added the autovectorizer should automatically insert casts to and from int, maintaining correctness. >> >> I've added an IR test that checks for correctly compiled outputs. Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Explicit nullptr checks I had a look at this since Emanuel is busy and this would need to be integrated until RDP 2 on Thursday next week. I'm not an expert in this code but the fix looks good to me. There's a little typo in the title `turncated` -> `truncated`. I fixed it in JBS, please update the PR title. I've submitted some testing again and will report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25440#pullrequestreview-3008778719 From mablakatov at openjdk.org Fri Jul 11 06:22:44 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Fri, 11 Jul 2025 06:22:44 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 02:01:06 GMT, Hao Sun wrote: >> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: >> >> remove the strictly-ordered FP implementation as unused > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1995: > >> 1993: // Vector reduction multiply for integral type with ASIMD instructions. >> 1994: // Note: temporary registers vtmp1 and vtmp2 are not used in some cases. >> 1995: // Note: vsrc and vtmp2 may match. > > I left a comment in this "resolved comment thread" several days ago. See https://github.com/openjdk/jdk/pull/23181/files#r2179185158. It might be overlooked since the whole conversation was marked as resolved already. > > I personally think we should not allow `vsrc` and `vtmp2` to match. Apologies for overlooking the comment. A suggestion that started the thread was marked as a nit so I felt okay about resolving it myself at the time. If `vsrc` and `vtmp2` match it implies that `vsrc` is allowed to be modified. This is used so that `reduce_mul_integral_le128b` may be invoked either independently or to process a tail after `reduce_mul_integral_gt128b` here: https://github.com/openjdk/jdk/pull/23181/files#diff-75bfb44278df267ce4978393b9b6b6030a7e23065ca15436fb1a5009debc6e81R2106 . In the latter case a temporary register holding intermediate result value is passed to both `vsrc` and `vtmp2` parameters. I can see it being somewhat confusing though. I could add an explicit boolean flag parameter, e.g. `is_tail_processing`, and do assertion checks based on its value. And extend the function comment with the described above. I'm happy to consider other suggestions as well if any :) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199735442 From chagedorn at openjdk.org Fri Jul 11 06:34:38 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Jul 2025 06:34:38 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 12:54:55 GMT, Marc Chevalier wrote: > In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. > > Meaning that in the IR rule > > @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) > > The interpreted `\w` is interpreted as a group reference, and we get > > java.lang.IllegalArgumentException: Illegal group reference > > so we should write instead > > @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) > > To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). > > Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. > > Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! > > In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. > > Thanks, > Marc Good improvement! Thanks for fixing it. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26243#pullrequestreview-3008817657 From chagedorn at openjdk.org Fri Jul 11 06:34:39 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Jul 2025 06:34:39 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder In-Reply-To: <3BYjYX_c2LSt2wepKGpXhRC85bJfz2wAmJhCMY2CMb0=.ed132047-cf0a-40a6-8a3e-69973d0bbaf1@github.com> References: <3BYjYX_c2LSt2wepKGpXhRC85bJfz2wAmJhCMY2CMb0=.ed132047-cf0a-40a6-8a3e-69973d0bbaf1@github.com> Message-ID: On Thu, 10 Jul 2025 14:48:10 GMT, Manuel H?ssig wrote: >> In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. >> >> Meaning that in the IR rule >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) >> >> The interpreted `\w` is interpreted as a group reference, and we get >> >> java.lang.IllegalArgumentException: Illegal group reference >> >> so we should write instead >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) >> >> To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). >> >> Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. >> >> Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! >> >> In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. >> >> Thanks, >> Marc > > test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/irrule/checkattribute/parsing/RawIRNode.java line 65: > >> 63: nodeRegex = regexForVectorIRNode(nodeRegex, vmInfo, bound); >> 64: } else if (userPostfix.isValid()) { >> 65: nodeRegex = nodeRegex.replaceAll(IRNode.IS_REPLACED, java.util.regex.Matcher.quoteReplacement(userPostfix.value())); > > Perhaps you might want to `import java.util.regex.Matcher` to make it a bit more concise? Yes, please import it instead. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26243#discussion_r2199760949 From yadongwang at openjdk.org Fri Jul 11 06:52:38 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Fri, 11 Jul 2025 06:52:38 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: <3J94H9P-F9-a8B96eqgDvCxYKywOB_7iSr_Ra4cwSk0=.f37e2ac6-f89b-4448-9acc-e0622712f786@github.com> On Fri, 11 Jul 2025 05:00:25 GMT, Hao Sun wrote: > Hi. Not a code review. Our internal CI reported a jtreg failure `test/hotspot/jtreg/sources/TestNoNULL.java` with this PR on Nvidia Grace machine. > > * How to reproduce the failure: build one fastdebug JDK and run `make test TEST="test/hotspot/jtreg/sources/TestNoNULL.java"` > * The snippet of the error log > > ``` > STDERR: > Error: 'NULL' found in /tmp/src/hotspot/cpu/aarch64/aarch64.ad at line 4563: > n->get_ptr_type()->isa_rawptr() != NULL && > java.lang.RuntimeException: Test found 1 usages of 'NULL' in source files. See errors above. > at TestNoNULL.main(TestNoNULL.java:84) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:565) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335) > at java.base/java.lang.Thread.run(Thread.java:1474) > > JavaTest Message: Test threw exception: java.lang.RuntimeException > JavaTest Message: shutting down test > > > TEST RESULT: Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Test found 1 usages of 'NULL' in source files. See errors above. > ``` > > It seems that the newly added assertion by this PR is hit in this jtreg case. Could you help take a look at this issue? Thanks. Sure, I will follow Andrew's suggestion, removing immByteMapBase. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3060852029 From jbhateja at openjdk.org Fri Jul 11 06:57:38 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 11 Jul 2025 06:57:38 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: <7QVWVj5vpSB42THa2rx-oxMqhH76qMZ5MBJjindRiLo=.b825076a-aa9c-4b86-94b6-0a593f2240ac@github.com> References: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> <6SXA9ZrXBDhZLyXP3lXbkpl4dl3iocvDpzPrUpIQOl8=.9b025be2-848b-4b78-a5e4-929cb7e9f798@github.com> <7QVWVj5vpSB42THa2rx-oxMqhH76qMZ5MBJjindRiLo=.b825076a-aa9c-4b86-94b6-0a593f2240ac@github.com> Message-ID: On Thu, 10 Jul 2025 08:06:18 GMT, erifan wrote: > OK. But in order to cover various cases, the implementation may be a bit troublesome. The solution I thought of is to **check whether the architecture supports VectorLongToMask, MaskAll and Replicate in `LibraryCallKit::inline_vector_frombits_coerced`. If it does, generate VectorLongToMask, and then convert it to MaskAll or Replicate in IGVN**. This is similar to the current implementation of vector rotate. > > At the same time, this conversion may affect some other optimizations, such as `VectorMaskToLong(VectorLongToMask (x)) => x` and `VectorStoreMask(VectorLoadMask (x)) => x`. So we also need to fix these effects. For completeness, we should handle maskAll in the Identity transform of VectorMaskToLongNode. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2199807660 From mchevalier at openjdk.org Fri Jul 11 07:09:29 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 07:09:29 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder [v2] In-Reply-To: References: Message-ID: > In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. > > Meaning that in the IR rule > > @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) > > The interpreted `\w` is interpreted as a group reference, and we get > > java.lang.IllegalArgumentException: Illegal group reference > > so we should write instead > > @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) > > To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). > > Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. > > Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! > > In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Import ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26243/files - new: https://git.openjdk.org/jdk/pull/26243/files/0a9d1093..da6d7b0d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26243&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26243&range=00-01 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26243.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26243/head:pull/26243 PR: https://git.openjdk.org/jdk/pull/26243 From mchevalier at openjdk.org Fri Jul 11 07:09:30 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 07:09:30 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 12:54:55 GMT, Marc Chevalier wrote: > In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. > > Meaning that in the IR rule > > @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) > > The interpreted `\w` is interpreted as a group reference, and we get > > java.lang.IllegalArgumentException: Illegal group reference > > so we should write instead > > @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) > > To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). > > Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. > > Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! > > In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. > > Thanks, > Marc Imported. Could you take a look again? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26243#issuecomment-3060903679 From mchevalier at openjdk.org Fri Jul 11 07:09:54 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 07:09:54 GMT Subject: RFR: 8359344: C2: Malformed control flow after intrinsic bailout [v6] In-Reply-To: <-rnlrm6PHRZeO1izbXh5nOrm368YKrsFft1u6SHXzWA=.9c8e6646-0b72-4705-895e-f795f74f3906@github.com> References: <-rnlrm6PHRZeO1izbXh5nOrm368YKrsFft1u6SHXzWA=.9c8e6646-0b72-4705-895e-f795f74f3906@github.com> Message-ID: On Thu, 10 Jul 2025 06:13:27 GMT, Marc Chevalier wrote: >> When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 >> >> This is enforced by restoring the old state, like in >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 >> >> That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: >> >> ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) >> >> >> Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. >> >> Another situation is somewhat worse, when happening during parsing. It can lead to such cases: >> >> ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) >> >> The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? >> >> This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: >> >> https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 >> >> And here there is the challenge: >> - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) >> ... > > Marc Chevalier has updated the pull request incrementally with two additional commits since the last revision: > > - Forgot to destruct_map_clone > - +'_' and ctor init Thanks @vnkozlov and @TobiHartmann for reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25936#issuecomment-3060908007 From mchevalier at openjdk.org Fri Jul 11 07:09:55 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 07:09:55 GMT Subject: Integrated: 8359344: C2: Malformed control flow after intrinsic bailout In-Reply-To: References: Message-ID: On Mon, 23 Jun 2025 13:09:43 GMT, Marc Chevalier wrote: > When intrinsic bailout, we assume that the control in the `LibraryCallKit` did not change: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L137 > > This is enforced by restoring the old state, like in > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L1722-L1732 > > That is good, but not sufficient. First, the most obvious, one could have already built some structure without moving the control. For instance, we can obtain something such as: > > ![1 after-intrinsic-bailout-during-late-inlining](https://github.com/user-attachments/assets/2fd255cc-0bfc-4841-8dd1-f64d502e0ee1) > > > Here, during late inlining, the call `323` is candidate to be inline, but that bails out. Yet, a call to `make_unsafe_address` was made, which built nodes `354 If` and everything under. This is needed as tests are made on the resulting nodes (especially `366 AddP`) to know whether we should bail out or not. At the end, we get 2 control successor to `346 IfFalse`: the call that is not removed and the leftover of the intrinsic that will be cleanup much later, but not by RemoveUseless. > > Another situation is somewhat worse, when happening during parsing. It can lead to such cases: > > ![2 after-intrinsic-bailout-during-parsing](https://github.com/user-attachments/assets/4524c615-6521-4f0d-8f61-c426f9179035) > > The nodes `31 OpaqueNotNull`, `31 If`, `36 IfTrue`, `33 IfFalse`, `35 Halt`, `44 If`, `45 IfTrue`, `46 IfFalse` are leftover from a bailing out intrinsic. The replacement call `49 CallStaticJava` should come just under `5 Parm`, but the control was updated and the call is actually built under `36 If`. Then, why does the previous assert doesn't complain? > > This is because there is more than one control, or one map. In intrinsics that need to restore their state, the initial `SafePoint` map is cloned, the clone is kept aside, and if needed (bailing out), we set the current map to this saved clone. But there is another map from which the one of the `LibraryCallKit` comes, and that survives longer, it's the one that is contained in the `JVMState`: > > https://github.com/openjdk/jdk/blob/c4fb00a7be51c7a05a29d3d57d787feb5c698ddf/src/hotspot/share/opto/library_call.cpp#L101-L102 > > And here there is the challenge: > - the `JVMState jvms` contains a `SafePoint` map, this map must have `jvms` as `jvms` (pointer comparison) > - we can't really change the pointer, just the content > -... This pull request has now been integrated. Changeset: 3ffc5b9e Author: Marc Chevalier URL: https://git.openjdk.org/jdk/commit/3ffc5b9ef720a07143ef5728d2597afdf2f2c251 Stats: 358 lines in 7 files changed: 240 ins; 75 del; 43 mod 8359344: C2: Malformed control flow after intrinsic bailout Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/25936 From chagedorn at openjdk.org Fri Jul 11 07:13:38 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Jul 2025 07:13:38 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder [v2] In-Reply-To: References: Message-ID: <6HRJooFsxbb_yBtzJ97j-DdyRhE4ACIKs6NGCRX4Xgk=.a48cfb73-57bd-4d1f-86b5-01529b4ca550@github.com> On Fri, 11 Jul 2025 07:09:29 GMT, Marc Chevalier wrote: >> In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. >> >> Meaning that in the IR rule >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) >> >> The interpreted `\w` is interpreted as a group reference, and we get >> >> java.lang.IllegalArgumentException: Illegal group reference >> >> so we should write instead >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) >> >> To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). >> >> Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. >> >> Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! >> >> In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Import Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26243#pullrequestreview-3008936485 From chagedorn at openjdk.org Fri Jul 11 07:20:41 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Jul 2025 07:20:41 GMT Subject: RFR: 8342941: IGV: Add new graph dumps for post loop, empty loop removal, and one iteration removal [v3] In-Reply-To: <-qvrPep0_75olkxXj9BT74oMIHTfxwgshrHnqQC9BuU=.501e3840-2b5d-4c7c-b2fe-891a167c66d8@github.com> References: <_p5Jj77u1VyyW0eVneXqeNjmngTvSvFi94_FALv6swk=.d4e5aec1-dd73-48ed-8d7f-3080207be763@github.com> <-qvrPep0_75olkxXj9BT74oMIHTfxwgshrHnqQC9BuU=.501e3840-2b5d-4c7c-b2fe-891a167c66d8@github.com> Message-ID: On Thu, 10 Jul 2025 19:07:30 GMT, Daniel Lund?n wrote: >> Saranya Natarajan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - fix 2 of review >> - Merge master >> - Addressing review comments >> - Initial Fix > > src/hotspot/share/opto/phasetype.hpp line 83: > >> 81: flags(AFTER_REMOVE_EMPTY_LOOP, "After Remove Empty Loop") \ >> 82: flags(BEFORE_ONE_ITERATION_LOOP, "Before Replacing One Iteration Loop") \ >> 83: flags(AFTER_ONE_ITERATION_LOOP, "After Replacing One Iteration Loop") \ > > Very much a nit, but I think this should be "One-Iteration Loop". Or, is it in fact one _iteration loop_ (as it reads now)? Looking at the code, I think it is the former. @chhagedorn can maybe clarify? > > This is not specific to your changeset, but also appears in existing source code comments. Maybe a good opportunity to clean this up everywhere? > > Also, maybe "Replacing" should be "Replace"? Seems to better fit the style used for other phase names. One-Iteration loop sounds better indeed. I also agree with the other suggestions. Something else I've noticed is that we could also benefit when we add dumps for `duplicate_loop_backedge()` which creates a new loop node (i.e. could be seen as "major modification"). I just looked into recently and found myself adding dumps there manually for debugging. I guess since this is a dump adding RFE, we could also add that one. What do you think? But then we would need to update the PR title to something like "add various new graph dumps during loop opts". ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25756#discussion_r2199865344 From chagedorn at openjdk.org Fri Jul 11 07:24:45 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Jul 2025 07:24:45 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v4] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 00:37:10 GMT, Guanqiang Han wrote: >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. > > Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Remove the unused variable > - Merge remote-tracking branch 'upstream/master' into 8361140 > - update regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - update modification and add regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp > > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. Looks good, thanks! I'll give this a spinning in our testing and report back again. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26125#pullrequestreview-3008979770 From mhaessig at openjdk.org Fri Jul 11 07:26:40 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 11 Jul 2025 07:26:40 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder [v2] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 07:09:29 GMT, Marc Chevalier wrote: >> In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. >> >> Meaning that in the IR rule >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) >> >> The interpreted `\w` is interpreted as a group reference, and we get >> >> java.lang.IllegalArgumentException: Illegal group reference >> >> so we should write instead >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) >> >> To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). >> >> Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. >> >> Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! >> >> In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Import The copyright in `RawIRNode.java` and `TestMergeStores.java` needs updating. ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26243#pullrequestreview-3008985841 From mchevalier at openjdk.org Fri Jul 11 07:30:55 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 07:30:55 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder [v3] In-Reply-To: References: Message-ID: > In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. > > Meaning that in the IR rule > > @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) > > The interpreted `\w` is interpreted as a group reference, and we get > > java.lang.IllegalArgumentException: Illegal group reference > > so we should write instead > > @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) > > To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). > > Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. > > Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! > > In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: i ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26243/files - new: https://git.openjdk.org/jdk/pull/26243/files/da6d7b0d..606e2242 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26243&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26243&range=01-02 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26243.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26243/head:pull/26243 PR: https://git.openjdk.org/jdk/pull/26243 From duke at openjdk.org Fri Jul 11 07:39:41 2025 From: duke at openjdk.org (Guanqiang Han) Date: Fri, 11 Jul 2025 07:39:41 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v4] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 07:22:23 GMT, Christian Hagedorn wrote: >> Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: >> >> - Remove the unused variable >> - Merge remote-tracking branch 'upstream/master' into 8361140 >> - update regression test >> - Merge remote-tracking branch 'upstream/master' into 8361140 >> - update modification and add regression test >> - Merge remote-tracking branch 'upstream/master' into 8361140 >> - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp >> >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. > > Looks good, thanks! I'll give this a spinning in our testing and report back again. @chhagedorn @JohnTortugo Thanks again for all your feedback. It?s been very helpful! If possible, could you please run /sponsor? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26125#issuecomment-3061047171 From haosun at openjdk.org Fri Jul 11 07:39:45 2025 From: haosun at openjdk.org (Hao Sun) Date: Fri, 11 Jul 2025 07:39:45 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 06:20:06 GMT, Mikhail Ablakatov wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1995: >> >>> 1993: // Vector reduction multiply for integral type with ASIMD instructions. >>> 1994: // Note: temporary registers vtmp1 and vtmp2 are not used in some cases. >>> 1995: // Note: vsrc and vtmp2 may match. >> >> I left a comment in this "resolved comment thread" several days ago. See https://github.com/openjdk/jdk/pull/23181/files#r2179185158. It might be overlooked since the whole conversation was marked as resolved already. >> >> I personally think we should not allow `vsrc` and `vtmp2` to match. > > Apologies for overlooking the comment. A suggestion that started the thread was marked as a nit so I felt okay about resolving it myself at the time. > > If `vsrc` and `vtmp2` match it implies that `vsrc` is allowed to be modified. This is used so that `reduce_mul_integral_le128b` may be invoked either independently or to process a tail after `reduce_mul_integral_gt128b` here: https://github.com/openjdk/jdk/pull/23181/files#diff-75bfb44278df267ce4978393b9b6b6030a7e23065ca15436fb1a5009debc6e81R2106 . In the latter case a temporary register holding intermediate result value is passed to both `vsrc` and `vtmp2` parameters. > > I can see it being somewhat confusing though. I could add an explicit boolean flag parameter, e.g. `is_tail_processing`, and do assertion checks based on its value. And extend the function comment with the described above. > > I'm happy to consider other suggestions as well if any :) I see. Thanks for your explanation. Current version is okay to me. Perhaps we may want to add more comments here. Suggestion: // Note: vsrc and vtmp2 may match when this function is invoked by `reduce_mul_integral_gt128b()` // as a tail call and vsrc holds the intermediate results. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199918736 From mhaessig at openjdk.org Fri Jul 11 07:34:44 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 11 Jul 2025 07:34:44 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder [v3] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 07:30:55 GMT, Marc Chevalier wrote: >> In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. >> >> Meaning that in the IR rule >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) >> >> The interpreted `\w` is interpreted as a group reference, and we get >> >> java.lang.IllegalArgumentException: Illegal group reference >> >> so we should write instead >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) >> >> To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). >> >> Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. >> >> Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! >> >> In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > i Thank you for addressing the comments. Looks good. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26243#pullrequestreview-3009016845 From mablakatov at openjdk.org Fri Jul 11 07:35:47 2025 From: mablakatov at openjdk.org (Mikhail Ablakatov) Date: Fri, 11 Jul 2025 07:35:47 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: <6H9X-NXKOGd9BZVhTDiKNf7OO2KQTciRKGnXY-5C9yA=.e25f9e69-44c2-48d1-b4e3-cb8f1af79546@github.com> <_gHaFQTNq2bApeWAE88cWxcNULRDqndSSo3hrY31FgI=.132b7c24-7205-4877-9b95-3d9d13ac7ec8@github.com> Message-ID: <-SwJHROQB4jO9nlICIWSwNGXZDIQUy8O54baR-Xe80o=.f7c4fd43-330d-4870-ae4b-316ab7507b06@github.com> On Fri, 11 Jul 2025 06:15:31 GMT, Xiaohong Gong wrote: >> Precisely that, @shqking , thank you. I found this while evaluating the effect the patch has on auto-vectorization. > > I see. Thanks! So what is the type of bt? Is it an integer type of floating-point one? If it's an integer type, I think changing or not does not have difference. But if it is floating-point type, we do not support the non-strict-ordered anyway and they are not enabled in SLP on AArch64. I'm just curious whether this change has any relationship with this PR. If not, I suggest not touching it now. Seems there is the same change in this PR https://github.com/openjdk/jdk/pull/23181. @XiaohongGong , JIC, you've referenced the PR you left this comment in. Did you intend to post it somewhere else? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199910206 From mchevalier at openjdk.org Fri Jul 11 07:38:40 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 07:38:40 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder [v3] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 07:32:13 GMT, Manuel H?ssig wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> i > > Thank you for addressing the comments. Looks good. Thanks @mhaessig! @chhagedorn, I still need a the almighty powers of a reviewer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26243#issuecomment-3061042788 From duke at openjdk.org Fri Jul 11 07:35:46 2025 From: duke at openjdk.org (duke) Date: Fri, 11 Jul 2025 07:35:46 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v4] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 00:37:10 GMT, Guanqiang Han wrote: >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. > > Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Remove the unused variable > - Merge remote-tracking branch 'upstream/master' into 8361140 > - update regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - update modification and add regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp > > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. @hgqxjj Your change (at version 0e9aa956d7966d81ac81c799ad054016bf78cbba) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26125#issuecomment-3061036381 From bkilambi at openjdk.org Fri Jul 11 08:05:46 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 11 Jul 2025 08:05:46 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13] In-Reply-To: References: Message-ID: <_4Car7g0KZ6-4OYHTt0B2ftw3a2amCHt9kFjVUXdA2M=.6428d19d-189b-4707-8350-4cc42ac30d47@github.com> On Fri, 11 Jul 2025 02:27:12 GMT, Hao Sun wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Change match rule names to lowercase > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 7189: > >> 7187: effect(TEMP_DEF dst, TEMP tmp); >> 7188: match(Set dst (SelectFromTwoVector (Binary index src1) src2)); >> 7189: format %{ "vselect_from_two_vectors_Neon_10_11 $dst, $src1, $src2, $index\t# vector (8B/16B/4S/8S/2I/4I/2F/4F). KILL $tmp" %} > > nit: here and several other sites. We also need use lower cases in the `format` clause. > > Suggestion: > > format %{ "vselect_from_two_vectors_neon_10_11 $dst, $src1, $src2, $index\t# vector (8B/16B/4S/8S/2I/4I/2F/4F). KILL $tmp" %} Sorry my bad, I missed it but the new patch (after @XiaohongGong's suggestion) doesnt have the separate neon/sve match rules anymore and I have made sure the match rule name in the format matches the match rule. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2199971472 From aph at openjdk.org Fri Jul 11 08:25:54 2025 From: aph at openjdk.org (Andrew Haley) Date: Fri, 11 Jul 2025 08:25:54 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 13:53:25 GMT, Mikhail Ablakatov wrote: >> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used. >> >> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still. >> >> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks. >> >> Benchmarks results: >> >> Neoverse-V1 (SVE 256-bit) >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms >> >> >> Fujitsu A64FX (SVE 512-bit): >> >> Benchmark (size) Mode master PR Units >> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms >> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms >> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms >> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms >> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms >> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms > > Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: > > remove the strictly-ordered FP implementation as unused src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 4073: > 4071: f(0b101111, 15, 10), rf(Zn, 5), rf(Zd, 0); > 4072: } > 4073: This pattern should be in a section _SVE Integer Reduction_, C4.1.37. I'm not sure if any other instructions in that group are defined yet, but if not please start the section. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2199991598 From jbhateja at openjdk.org Fri Jul 11 08:51:17 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 11 Jul 2025 08:51:17 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v10] In-Reply-To: References: Message-ID: <7UVWuEfttp-9smTO095TsfR5wZ3pEwuOSp0M3rughnM=.8a6f46c6-a406-44ec-ac04-e70736b94ade@github.com> On Thu, 10 Jul 2025 15:19:50 GMT, Bhavana Kilambi wrote: >> test/hotspot/jtreg/compiler/vectorapi/TestSelectFromTwoVectorOp.java line 234: >> >>> 232: >>> 233: @Test >>> 234: @IR(counts = {IRNode.SELECT_FROM_TWO_VECTOR_VS, IRNode.VECTOR_SIZE_8, ">0"}, >> >> Hi @Bhavana-Kilambi , >> Kindly also include x86-specific feature checks in IR rules for this test. >> >> You can directly integrate attached patch. >> >> [select_from_ir_feature.txt](https://github.com/user-attachments/files/21034639/select_from_ir_feature.txt) > > Hi @jatin-bhateja , have you tested `jdk/incubator/vector` tests with your patch on x86? Hi @Bhavana-Kilambi , x86 implementation favors vector sizes greater than 64-bit, since AVX512 has a direct two-vector permute instruction. For legacy targets, I lower the IR through **_LowerSelectFromTwoVectorOperation_** to blend a pair of rearranges, such that an exceptional index (above vector lane count) selects a lane from the second source vector, else from the first one. Accidentally, my rough patch included an experimental change. Please update with the latest version. Best Regards [select_from_x86_patch.txt](https://github.com/user-attachments/files/21179211/select_from_x86_patch.txt) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2200094770 From aph at openjdk.org Fri Jul 11 08:55:46 2025 From: aph at openjdk.org (Andrew Haley) Date: Fri, 11 Jul 2025 08:55:46 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 08:11:30 GMT, Andrew Haley wrote: >> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision: >> >> remove the strictly-ordered FP implementation as unused > > src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 4073: > >> 4071: f(0b101111, 15, 10), rf(Zn, 5), rf(Zd, 0); >> 4072: } >> 4073: > > This pattern should be in a section _SVE Integer Reduction_, C4.1.37. I'm not sure if any other instructions in that group are defined yet, but if not please start the section. Sorry, the unpredicated version should be in the _SVE Integer Misc - Unpredicated_ section. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2200087303 From shade at openjdk.org Fri Jul 11 09:20:10 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 11 Jul 2025 09:20:10 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data [v2] In-Reply-To: References: Message-ID: <6PnCrAx7F67U67Lqv1pvLvb4FBpJoUQ_OovolEUDMoA=.31a589ef-bc56-4764-bcf5-850c5de2c4e4@github.com> On Thu, 10 Jul 2025 17:10:22 GMT, Igor Veresov wrote: >> Use OopStorage directly instead of JNI handles. Note that we never destroy TrainingData objects, so we don't need to concern ourselves with freeing the OopStorage entries. Also, keeping the klasses alive is only necessary during the training run. During the replay the klasses TD objects refer to are always alive. > > Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments I guess `klass->klass_holder()` is not needed to make sure we keep alive the class correctly, if we are running at init safepoint anyway. Given that code also works fine with storing Java mirror as root, and this change only moves that from JNI handles to OopStorage, this looks fine. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26233#pullrequestreview-3009424634 From xgong at openjdk.org Fri Jul 11 09:34:42 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 11 Jul 2025 09:34:42 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: <-SwJHROQB4jO9nlICIWSwNGXZDIQUy8O54baR-Xe80o=.f7c4fd43-330d-4870-ae4b-316ab7507b06@github.com> References: <6H9X-NXKOGd9BZVhTDiKNf7OO2KQTciRKGnXY-5C9yA=.e25f9e69-44c2-48d1-b4e3-cb8f1af79546@github.com> <_gHaFQTNq2bApeWAE88cWxcNULRDqndSSo3hrY31FgI=.132b7c24-7205-4877-9b95-3d9d13ac7ec8@github.com> <-SwJHROQB4jO9nlICIWSwNGXZDIQUy8O54baR-Xe80o=.f7c4fd43-330d-4870-ae4b-316ab7507b06@github.com> Message-ID: On Fri, 11 Jul 2025 07:33:19 GMT, Mikhail Ablakatov wrote: >> I see. Thanks! So what is the type of bt? Is it an integer type of floating-point one? If it's an integer type, I think changing or not does not have difference. But if it is floating-point type, we do not support the non-strict-ordered anyway and they are not enabled in SLP on AArch64. I'm just curious whether this change has any relationship with this PR. If not, I suggest not touching it now. Seems there is the same change in this PR https://github.com/openjdk/jdk/pull/23181. > > @XiaohongGong , JIC, you've referenced the PR you left this comment in. Did you intend to post it somewhere else? Oh, sorry, my bad. I intended to post this one: https://github.com/openjdk/jdk/pull/21895/files#diff-7b82624b78127158abbce6835eeba196bd062aee59512ec2d4e4c8c7d681573b ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2200220379 From thartmann at openjdk.org Fri Jul 11 10:08:41 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 11 Jul 2025 10:08:41 GMT Subject: RFR: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly turncated for byte and short [v9] In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 12:59:24 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch fixes cases in SuperWord when compiling subword types where vectorized code would be given a narrower type than expected, leading to miscompilation due to truncation. This fix is a generalization of the same fix applied for `Integer.reverseBytes` in [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). The patch introduces a check for nodes that are known to tolerate truncation, so that any future cases of subword truncation will avoid creating miscompiled code. >> >> The patch reuses the existing logic to set the type of the vectors to int, which currently disables vectorization for the affected patterns entirely. Once [JDK-8342095](https://bugs.openjdk.org/browse/JDK-8342095) is merged and automatic casting support is added the autovectorizer should automatically insert casts to and from int, maintaining correctness. >> >> I've added an IR test that checks for correctly compiled outputs. Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Explicit nullptr checks All tests passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25440#issuecomment-3061622691 From chagedorn at openjdk.org Fri Jul 11 10:21:44 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Jul 2025 10:21:44 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder [v3] In-Reply-To: References: Message-ID: <-bTfSUt8-boUIgH-411H7Di3_R3w8Ztj5FENzONdg24=.239c14ad-6e87-4682-9025-e5c54273bb6c@github.com> On Fri, 11 Jul 2025 07:30:55 GMT, Marc Chevalier wrote: >> In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. >> >> Meaning that in the IR rule >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) >> >> The interpreted `\w` is interpreted as a group reference, and we get >> >> java.lang.IllegalArgumentException: Illegal group reference >> >> so we should write instead >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) >> >> To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). >> >> Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. >> >> Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! >> >> In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > i Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26243#pullrequestreview-3009654293 From epeter at openjdk.org Fri Jul 11 10:23:42 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 11 Jul 2025 10:23:42 GMT Subject: RFR: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly turncated for byte and short [v9] In-Reply-To: References: Message-ID: <_M-WU-cqWGnWSOxm8WOhkz7_EoHcx6YMrmtgPFGzG4U=.1dae8af4-6909-4ac5-bb7d-2cd9ca4b287c@github.com> On Mon, 30 Jun 2025 12:59:24 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch fixes cases in SuperWord when compiling subword types where vectorized code would be given a narrower type than expected, leading to miscompilation due to truncation. This fix is a generalization of the same fix applied for `Integer.reverseBytes` in [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). The patch introduces a check for nodes that are known to tolerate truncation, so that any future cases of subword truncation will avoid creating miscompiled code. >> >> The patch reuses the existing logic to set the type of the vectors to int, which currently disables vectorization for the affected patterns entirely. Once [JDK-8342095](https://bugs.openjdk.org/browse/JDK-8342095) is merged and automatic casting support is added the autovectorizer should automatically insert casts to and from int, maintaining correctness. >> >> I've added an IR test that checks for correctly compiled outputs. Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: > > Explicit nullptr checks @jaskarth LGMT thanks for your work ? ------------- Marked as reviewed by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25440#pullrequestreview-3009660376 From epeter at openjdk.org Fri Jul 11 10:23:43 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 11 Jul 2025 10:23:43 GMT Subject: RFR: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly turncated for byte and short In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 03:52:14 GMT, Jasmine Karthikeyan wrote: >> @jaskarth the tests look better now. I still saw this failure: >> >> `jdk/incubator/vector/Byte64VectorTests.java` >> >> Flags: `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting` or `-XX:UseAVX=0` or `-XX:UseAVX=2` ... probably no flags are actually required. >> >> `# assert(false) failed: Unexpected node in SuperWord truncation: ExtractB` > > @eme64 Thanks for running it again! I've pushed a fix for the `ExtractB` assert, and a proactive fix marking any nodes with `TypeVect` as their base type as non-truncating. @jaskarth There is a title mismatch though ------------- PR Comment: https://git.openjdk.org/jdk/pull/25440#issuecomment-3061682166 From mchevalier at openjdk.org Fri Jul 11 10:34:25 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 10:34:25 GMT Subject: RFR: 8361494: [IR Framework] Escape too much in replacement of placeholder [v3] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 07:30:55 GMT, Marc Chevalier wrote: >> In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. >> >> Meaning that in the IR rule >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) >> >> The interpreted `\w` is interpreted as a group reference, and we get >> >> java.lang.IllegalArgumentException: Illegal group reference >> >> so we should write instead >> >> @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) >> >> To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). >> >> Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. >> >> Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! >> >> In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > i Thanks @mhaessig and @chhagedorn! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26243#issuecomment-3061701499 From mchevalier at openjdk.org Fri Jul 11 10:48:48 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 10:48:48 GMT Subject: Integrated: 8361494: [IR Framework] Escape too much in replacement of placeholder In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 12:54:55 GMT, Marc Chevalier wrote: > In `RawIRNode::regex`, the call to `String::replaceAll` doesn't quote the replace string. > > Meaning that in the IR rule > > @IR(failOn = {IRNode.ALLOC_OF, "\\w+"}) > > The interpreted `\w` is interpreted as a group reference, and we get > > java.lang.IllegalArgumentException: Illegal group reference > > so we should write instead > > @IR(failOn = {IRNode.ALLOC_OF, "\\\\w+"}) > > To mean the interpreted string `\\w`, to mean an escaped single backslash. Same goes with `$` (used for nested classes). > > Since we don't want to refer to groups (and anyway, there are not in `IRNode.IS_REPLACED`), we just quote the replacement string with `java.util.regex.Matcher.quoteReplacement` to make it more usable. > > Note that you would still need to write `\$` since the `$` is the end of string regex, and needs to be escaped at the regex level (and not at the string, so it's not `$`, since `$` is not a special character). Before the fix, it should be `\\\$`. Phew! Regexes are bad enough, let's not escape them manually twice! > > In `test/hotspot/jtreg/compiler/c2/TestMergeStores.java`, that makes us save 1344 backslashes. > > Thanks, > Marc This pull request has now been integrated. Changeset: 76442f39 Author: Marc Chevalier URL: https://git.openjdk.org/jdk/commit/76442f39b9dd583f09a7adebb0fc5f37b6ef88ef Stats: 242 lines in 3 files changed: 15 ins; 0 del; 227 mod 8361494: [IR Framework] Escape too much in replacement of placeholder Reviewed-by: mhaessig, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/26243 From fjiang at openjdk.org Fri Jul 11 11:50:33 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 11 Jul 2025 11:50:33 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: References: Message-ID: > Hi, please consider. > [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. > The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. > If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. > This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. > We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. > > This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. > The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. > > Test on linux-riscv64: > - [x] Tier1-3 > > JMH data on P550 SBC for reference (w/o and w/ the patch): > > Before: > > Without COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op > ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op > ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op > ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op > ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op > ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op > ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op > ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op > ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op > ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op > ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op > ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op > ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op > ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op > ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op > > ------------------------------------------------------------------------- > With COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.667 ? 0.622 ns/op > Arra... Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - also keep overlapping flag - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - Revert RISCV Macro modification - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses - riscv: fix c1 primitive array clone intrinsic regression ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25976/files - new: https://git.openjdk.org/jdk/pull/25976/files/ca628e16..26148c2d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=03-04 Stats: 3723 lines in 102 files changed: 1932 ins; 1091 del; 700 mod Patch: https://git.openjdk.org/jdk/pull/25976.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25976/head:pull/25976 PR: https://git.openjdk.org/jdk/pull/25976 From fjiang at openjdk.org Fri Jul 11 11:50:35 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 11 Jul 2025 11:50:35 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v4] In-Reply-To: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> References: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> Message-ID: <2CXVwx9MOtsQZgcNNyeAcjFrv8dyzuhRV4XOeFeygzY=.bf505d47-37e9-4a0c-bae6-ff0b8f109faf@github.com> On Wed, 9 Jul 2025 10:07:31 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Revert RISCV Macro modification > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses > - riscv: fix c1 primitive array clone intrinsic regression @dean-long, thanks for the review. I have added `LIR_OpArrayCopy::overlapping`, could you please take another look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25976#issuecomment-3061976583 From fjiang at openjdk.org Fri Jul 11 11:50:36 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 11 Jul 2025 11:50:36 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v4] In-Reply-To: References: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> Message-ID: On Thu, 10 Jul 2025 22:42:44 GMT, Dean Long wrote: >> Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Revert RISCV Macro modification >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses >> - riscv: fix c1 primitive array clone intrinsic regression > > src/hotspot/share/c1/c1_LIR.cpp line 353: > >> 351: , _expected_type(expected_type) >> 352: , _flags(flags) { >> 353: #if defined(X86) || defined(AARCH64) || defined(S390) || defined(RISCV64) || defined(PPC64) > > Do we still need this #if? It would be nice if we can eventually remove it, but I guess arm32 support is missing. You are right, ARM32 support is not available, so we have to keep these platform guards for now. > src/hotspot/share/c1/c1_LIR.cpp line 354: > >> 352: , _flags(flags) { >> 353: #if defined(X86) || defined(AARCH64) || defined(S390) || defined(RISCV64) || defined(PPC64) >> 354: if (expected_type != nullptr && ((flags & ~LIR_OpArrayCopy::unaligned) == 0)) { > > I was concerned that this is platform-specific, but I checked and all platforms can handle unaligned or overlapping w/o using the stub. So maybe this should be using LIR_OpArrayCopy::unaligned|LIR_OpArrayCopy::overlapping? Yes, LIR_OpArrayCopy::overlapping was also reset if `OmitChecksFlag` is true. Added `LIR_OpArrayCopy::overlapping` too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2200508522 PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2200507132 From bulasevich at openjdk.org Fri Jul 11 12:02:43 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 11 Jul 2025 12:02:43 GMT Subject: [jdk25] RFR: 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate In-Reply-To: <2-vD19USVejKcQe9AuVa1tq9RCGdspEJ1JVbt5BVI_4=.8e45bdfc-8142-4408-b960-6b95aa338e53@github.com> References: <2-vD19USVejKcQe9AuVa1tq9RCGdspEJ1JVbt5BVI_4=.8e45bdfc-8142-4408-b960-6b95aa338e53@github.com> Message-ID: On Thu, 10 Jul 2025 17:40:20 GMT, Boris Ulasevich wrote: > This is the backport of the JVMCI metadata crash fix. > > Issue: > When flushing nmethods via CodeBlob::purge(), the JVMCI metadata was freed (mutable_data) but its size fields remained non-zero. As a result, invoking heap analytics via jcmd Compiler.CodeHeap_Analytics still walks the purged metadata and calls jvmci_name() on arbitrary memory, leading to intermittent crashes > > Fix: > Extend CodeBlob::purge() to zero out the _mutable_data_size, _relocation_size, and _metadata_size fields so that after a purge jvmci_data_size() returns 0 and CompileBroker::print_heapinfo() skips any JVMCI metadata Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26248#issuecomment-3062013056 From bulasevich at openjdk.org Fri Jul 11 12:02:44 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 11 Jul 2025 12:02:44 GMT Subject: [jdk25] Integrated: 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate In-Reply-To: <2-vD19USVejKcQe9AuVa1tq9RCGdspEJ1JVbt5BVI_4=.8e45bdfc-8142-4408-b960-6b95aa338e53@github.com> References: <2-vD19USVejKcQe9AuVa1tq9RCGdspEJ1JVbt5BVI_4=.8e45bdfc-8142-4408-b960-6b95aa338e53@github.com> Message-ID: On Thu, 10 Jul 2025 17:40:20 GMT, Boris Ulasevich wrote: > This is the backport of the JVMCI metadata crash fix. > > Issue: > When flushing nmethods via CodeBlob::purge(), the JVMCI metadata was freed (mutable_data) but its size fields remained non-zero. As a result, invoking heap analytics via jcmd Compiler.CodeHeap_Analytics still walks the purged metadata and calls jvmci_name() on arbitrary memory, leading to intermittent crashes > > Fix: > Extend CodeBlob::purge() to zero out the _mutable_data_size, _relocation_size, and _metadata_size fields so that after a purge jvmci_data_size() returns 0 and CompileBroker::print_heapinfo() skips any JVMCI metadata This pull request has now been integrated. Changeset: 44f5dfef Author: Boris Ulasevich URL: https://git.openjdk.org/jdk/commit/44f5dfef976bbe81c4b76b8b432f29ca2ea223d4 Stats: 3 lines in 2 files changed: 3 ins; 0 del; 0 mod 8358183: [JVMCI] crash accessing nmethod::jvmci_name in CodeCache::aggregate Reviewed-by: thartmann Backport-of: 74822ce12acaf9816aa49b75ab5817ced3710776 ------------- PR: https://git.openjdk.org/jdk/pull/26248 From tschatzl at openjdk.org Fri Jul 11 12:18:38 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Fri, 11 Jul 2025 12:18:38 GMT Subject: RFR: 8361952: Installation of MethodData::extra_data_lock() misses synchronization on reader side In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 10:43:21 GMT, Aleksey Shipilev wrote: > Looks reasonable. > > Two nits: > > * This is not technically double-checked locking, this is just an atomic installation. So the RFE synopsis is not that accurate. Fixed. > > * Maybe `Atomic::cmpxchg` should use a more relaxed memory ordering as well, to micro-optimize and offset the costs of now-proper acquire a bit. There is no need for default `memory_order_conservative` here. I think `memory_order_seq_cst` would do: gives acquire of `old` and release of new in the same package. I will leave that suggestion to another RFE. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26262#issuecomment-3062079334 From shade at openjdk.org Fri Jul 11 12:25:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 11 Jul 2025 12:25:39 GMT Subject: RFR: 8361952: Installation of MethodData::extra_data_lock() misses synchronization on reader side In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 09:47:05 GMT, Thomas Schatzl wrote: > Hi all, > > please review this change that fixes some recently introduced atomic installation of a mutex, where the memory barrier (`load_acquire`) on the reader side. Without it the reader might get a valid pointer to the `Mutex` created on the fly, without it being initialized properly. > > Found during code inspection for https://bugs.openjdk.org/browse/JDK-8361706 ; due to some suspicious hangs in the `MutexLocker` while cleaning klasses during class unloading in parallel (multiple threads hanging in `MethodData::clean_method_data`), executing the `vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine/TestDescription.java` test. > > Testing: gha > > Thanks, > Thomas OK, sure. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26262#pullrequestreview-3010074399 From aph at openjdk.org Fri Jul 11 12:26:54 2025 From: aph at openjdk.org (Andrew Haley) Date: Fri, 11 Jul 2025 12:26:54 GMT Subject: RFR: 8343689: AArch64: Optimize MulReduction implementation [v8] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 07:36:20 GMT, Hao Sun wrote: > I see. Thanks for your explanation. Current version is okay to me. Perhaps we may want to add more comments here. The current code is just the sort of trap for the maintainer that leads to hard-to-find bugs. It'd be much better to remove the need for this comment by forcing everyone to provide two distinct scratch registers. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2200592617 From shade at openjdk.org Fri Jul 11 12:44:44 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 11 Jul 2025 12:44:44 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 18:29:14 GMT, Yadong Wang wrote: > The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. > > C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. > > // The assembler store_check code will do an unsigned shift of the oop, > // then add it to _byte_map_base, i.e. > // > // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) > _byte_map = (CardValue*) rs.base(); > _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); > > In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. > > // Card Table Byte Map Base > operand immByteMapBase() > %{ > // Get base of card map > predicate((jbyte*)n->get_ptr() == > ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); > match(ConP); > > op_cost(0); > format %{ %} > interface(CONST_INTER); > %} > > // Load Byte Map Base Constant > instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) > %{ > match(Set dst con); > > ins_cost(INSN_COST); > format %{ "adr $dst, $con\t# Byte Map Base" %} > > ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); > > ins_pipe(ialu_imm); > %} > > As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: > 0xffff25caf08c: ldaxr x8, [x11] > 0xffff25caf090: cmp x10, x8 > 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any > 0xffff25caf098: stlxr w8, x28, [x11] > 0xffff25caf09c: cbnz w8, 0xffff25caf08c > 0xffff25caf0a0: orr x11, xzr, #0x3 > 0xffff25caf0a4: str x11, [x13] > 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none > 0xffff25caf0ac: str x14, [sp] > 0xffff25caf0b0: add x2, sp, #0x20 > 0xffff25caf0b4: adrp x1, 0xffff21730000 > 0xffff25caf0b8: bl 0xffff256fffc0 > 0xffff25caf0bc: ldr x14, [sp] > 0xffff25caf0c0: b 0xffff25caef80 > 0xffff25caf0c4: add x13, sp, #0x20 > 0xffff25caf0c8: adrp x12, 0xffff21730000 > 0xffff25caf0cc: ldr x10, [x13] > 0xffff25caf0d0: cmp x10, xzr > 0xffff25caf0d4: b.eq 0xffff25caf130 // b.none > 0xffff25caf0d8: ldr x11, [x12] > 0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f... It is such a beautiful bug to read about on Friday. So the net effect of this mismatch is that we miss oop relocation/record when `ConP` accidentally mismatches to card table base, did I get that right? > Yes, it maybe a better solution for jdk main line, because immPollPage was remove in https://bugs.openjdk.org/browse/JDK-8220051. But how about jdk8u backport? I think we should do these things separately: 1. `immByteMapBase` rule removal in AArch64, this PR, then backport it to 25, 21, 17, maybe to 11, 8 2. `immByteMapBase` rule removal in RISC-V, separate PR, then backport it to 25, 21 3. `immPollPage` rule removal in AArch64, in 11u and 8u specific PRs The backports for (1) would not be clean, as Generational Shenandoah barrier checks would likely trigger technical conflicts in the code that is being removed. So there is doubly no point in going for clean backports, we should really slice them by the rule we are removing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3062188862 From jbhateja at openjdk.org Fri Jul 11 13:01:32 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 11 Jul 2025 13:01:32 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v10] In-Reply-To: References: Message-ID: > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with six additional commits since the last revision: - Update src/hotspot/share/opto/intrinsicnode.cpp Co-authored-by: Emanuel Peter - Update src/hotspot/share/opto/intrinsicnode.cpp Co-authored-by: Emanuel Peter - Update src/hotspot/share/opto/intrinsicnode.cpp Co-authored-by: Emanuel Peter - Update src/hotspot/share/opto/intrinsicnode.cpp Co-authored-by: Emanuel Peter - Update src/hotspot/share/opto/intrinsicnode.cpp Co-authored-by: Emanuel Peter - Update src/hotspot/share/opto/intrinsicnode.cpp Co-authored-by: Emanuel Peter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23947/files - new: https://git.openjdk.org/jdk/pull/23947/files/96ecbac1..f00634a4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=08-09 Stats: 40 lines in 1 file changed: 1 ins; 14 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/23947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23947/head:pull/23947 PR: https://git.openjdk.org/jdk/pull/23947 From aph at openjdk.org Fri Jul 11 13:22:40 2025 From: aph at openjdk.org (Andrew Haley) Date: Fri, 11 Jul 2025 13:22:40 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet In-Reply-To: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 26 Jun 2025 12:13:19 GMT, Samuel Chee wrote: > AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg: > > ;; cmpxchg { > 0x0000e708d144cf60: mov x8, x2 > 0x0000e708d144cf64: casal x8, x3, [x0] > 0x0000e708d144cf68: cmp x8, x2 > ;; 0x1F1F1F1F1F1F1F1F > 0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f > ;; } cmpxchg > 0x0000e708d144cf70: cset x8, ne // ne = any > 0x0000e708d144cf74: dmb ish > > > According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2] > >> Atomically sets the value of a variable to the >> newValue with the memory semantics of setVolatile if >> the variable's current value, referred to as the witness >> value, == the expectedValue, as accessed with the memory >> semantics of getVolatile. > > > > Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen. > > Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant) > > This is also reflected by C2 not having a dmb for the same respective method. > > [1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) > [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...) On 04/07/2025 17:28, Samuel Chee wrote: > Hope this helps :) Thanks, this looks convincing. Please allow some time for me to do some more checking. This is a tricky area, and the the cost if we get it wrong is high. FYI, I'm still looking at this. It seems that the definition of barrier-ordered-before has been strengthened since this code was written. A test that I wrote a few years ago now passes on the online Herd7 simulator, where it used to fail. Back then I commented // At the time of writing we don't know of any AArch64 hardware that // reorders stores in this way, but the Reference Manual permits it. ... and confirmed my interpretation with the author of the Reference Manual. I'm guessing that older AArch64 implementations still in use never did such reorderings. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3044020068 PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3062325467 From jbhateja at openjdk.org Fri Jul 11 13:30:32 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 11 Jul 2025 13:30:32 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v11] In-Reply-To: References: Message-ID: <-TCK19ngXOwGp1EPss-clgnrwzy4-DWjjKFuuRJjB44=.68fe0cf1-fdd2-4c84-b5ab-dcffb3f705f0@github.com> > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Update test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23947/files - new: https://git.openjdk.org/jdk/pull/23947/files/f00634a4..c94779f5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=09-10 Stats: 364 lines in 1 file changed: 364 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23947/head:pull/23947 PR: https://git.openjdk.org/jdk/pull/23947 From jbhateja at openjdk.org Fri Jul 11 13:33:44 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 11 Jul 2025 13:33:44 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v8] In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 14:24:31 GMT, Emanuel Peter wrote: >> We can further constrain the value range bounds of bit compression and expansion once PR #17508 gets integrated. For now, I have developed the following draft demonstrates bound constraining with KnownBitLattice. >> >> >> // >> // Prototype of bit compress/expand value range computation >> // using KnownBits infrastructure. >> // >> >> #include >> #include >> #include >> #include >> >> template >> class KnownBitsLattice { >> private: >> U zeros; >> U ones; >> >> public: >> KnownBitsLattice(U lb, U ub); >> >> U getKnownZeros() { >> return zeros; >> } >> >> U getKnownOnes() { >> return ones; >> } >> >> long getKnownZerosCount() { >> uint64_t count = 0; >> asm volatile ("popcntq %1, %0 \n\t" : "=r"(count) : "r"(zeros) : "cc"); >> return count; >> } >> >> long getKnownOnesCount() { >> uint64_t count = 0; >> asm volatile ("popcntq %1, %0 \n\t" : "=r"(count) : "r"(ones) : "cc"); >> return count; >> } >> >> bool check_voilation() { >> // A given bit cannot be both zero or one. >> return (zeros & ones) != 0; >> } >> >> bool is_MSB_KnownOneBitsSet() { >> return (ones >> 63) == 1; >> } >> >> bool is_MSB_KnownZeroBitsSet() { >> return (zeros >> 63) == 1; >> } >> }; >> >> template >> KnownBitsLattice::KnownBitsLattice(U lb, U ub) { >> // To find KnownBitsLattice from a given value range >> // we first find the common prefix b/w upper and lower >> // bound, we then concertize known zeros and ones bit >> // based on common prefix. >> // e.g. >> // lb = 00110001 >> // ub = 00111111 >> // common prefix = 0011XXXX >> // knownbits.zeros = 11000000 >> // knownbits.ones = 00110000 >> // >> // conversely, for a give knownbits value we can find >> // lower and upper value ranges. >> // e.g. >> // knownbits.zeros = 0x00010001 >> // knownbits.ones = 0x10001100 >> // range.lo = knownbits.ones, this is because knownbits.ones are >> // guaranteed to be one. >> // range.hi = ~knownbits.zeros, this is an optimistic upper bound >> // which assumes all unset knownbits.zero >> // are ones. >> // Thus in above example, >> // range.lo = 0x8C >> // range.hi = 0xEE >> >> U lzcnt = 0; >> U common_prefix = lb ^ ub; >> asm volatile ("lzcntq %1, %0 \n\t" : "=r"(lzcnt) : "r"... > > @jatin-bhateja I think we are making progress, it seems to me now that the VM code is correct, at least as far as I can tell with visual inspection. > > We are still missing some additional tests, as I have asked for a few times already: > https://github.com/openjdk/jdk/pull/23947#issuecomment-2853896251 > > We should do something like this: > > public static test(int mask, int src) { > mask = Math.max(CON1, Math.min(CON2, mask)); > src = Math.max(CON2, Math.min(CON4, src)); > result = Integer.compress(src, mask); > int sum = 0; > if (sum > LIMIT_1) { sum += 1; } > if (sum > LIMIT_2) { sum += 2; } > if (sum > LIMIT_3) { sum += 4; } > if (sum > LIMIT_4) { sum += 8; } > if (sum > LIMIT_5) { sum += 16; } > if (sum > LIMIT_6) { sum += 32; } > if (sum > LIMIT_7) { sum += 64; } > if (sum > LIMIT_8) { sum += 128; } > return new int[] {sum, result}; > } > > > You could do the same pattern for `expand` too. > Then you pick random values using `Generators.java` for all the `CON` and `LIMIT`. > If we somehow produce a bad range, then the limit checks could constant fold wrongly, and then the `sum` would reflect this wrong result. Optimal would be to duplicate this pattern, and have one method that compiles, and one that runs in interpreter. That way, you can repeatedly call the methods with various `src` and `mask` values, and compare the output. Hi @eme64 , Updated the tests as per suggestion; however, for this bug fix patch, we are not doing aggressive value range optimization. I plan to extend value routines for compress/expand with the newly supported knownBits infrastructure in a subsequent RFE., Following is a prototype for the same. https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/knownBits_DFA/bit_compress_expand_KnownBits.java Best Regards, Jatin ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3062355806 From epeter at openjdk.org Fri Jul 11 13:49:48 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 11 Jul 2025 13:49:48 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v11] In-Reply-To: <-TCK19ngXOwGp1EPss-clgnrwzy4-DWjjKFuuRJjB44=.68fe0cf1-fdd2-4c84-b5ab-dcffb3f705f0@github.com> References: <-TCK19ngXOwGp1EPss-clgnrwzy4-DWjjKFuuRJjB44=.68fe0cf1-fdd2-4c84-b5ab-dcffb3f705f0@github.com> Message-ID: <5XrDq3Z0bfinuBjxwRkt9vRknXPxSMXH8XH1ROz8YFQ=.13f50f28-01cc-4b05-98b5-c16c4544f3f0@github.com> On Fri, 11 Jul 2025 13:30:32 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update test @jatin-bhateja Thanks for the updates! I'm going on vacation, so someone else will have to review this. I'll ping some people. test/hotspot/jtreg/compiler/c2/gvn/TestBitCompressValueTransform.java line 366: > 364: res += 800; > 365: } > 366: return res; Can you please use powers-of-2 instead? That way it cannot happen that one error masks out another. Imagine somehow we should only add `300` (I3), but instead add `100 + 200` (I1 and I2). ------------- PR Review: https://git.openjdk.org/jdk/pull/23947#pullrequestreview-3010398830 PR Review Comment: https://git.openjdk.org/jdk/pull/23947#discussion_r2200792513 From aph at openjdk.org Fri Jul 11 13:51:43 2025 From: aph at openjdk.org (Andrew Haley) Date: Fri, 11 Jul 2025 13:51:43 GMT Subject: RFR: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 [v10] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 12:58:23 GMT, Evgeny Astigeevich wrote: >> Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. >> >> This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. >> >> Tested on Linux and MacOS with and without hsdis: >> - Fastdebug: test passed >> - Slowdebug: test passed. >> - Release: test passed. > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Remove lambda Marked as reviewed by aph (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26072#pullrequestreview-3010407313 From jbhateja at openjdk.org Fri Jul 11 14:16:03 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 11 Jul 2025 14:16:03 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v12] In-Reply-To: References: Message-ID: > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Update test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23947/files - new: https://git.openjdk.org/jdk/pull/23947/files/c94779f5..c79efe09 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=10-11 Stats: 64 lines in 1 file changed: 0 ins; 0 del; 64 mod Patch: https://git.openjdk.org/jdk/pull/23947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23947/head:pull/23947 PR: https://git.openjdk.org/jdk/pull/23947 From jbhateja at openjdk.org Fri Jul 11 14:16:03 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 11 Jul 2025 14:16:03 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v11] In-Reply-To: <5XrDq3Z0bfinuBjxwRkt9vRknXPxSMXH8XH1ROz8YFQ=.13f50f28-01cc-4b05-98b5-c16c4544f3f0@github.com> References: <-TCK19ngXOwGp1EPss-clgnrwzy4-DWjjKFuuRJjB44=.68fe0cf1-fdd2-4c84-b5ab-dcffb3f705f0@github.com> <5XrDq3Z0bfinuBjxwRkt9vRknXPxSMXH8XH1ROz8YFQ=.13f50f28-01cc-4b05-98b5-c16c4544f3f0@github.com> Message-ID: On Fri, 11 Jul 2025 13:46:18 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test > > test/hotspot/jtreg/compiler/c2/gvn/TestBitCompressValueTransform.java line 366: > >> 364: res += 800; >> 365: } >> 366: return res; > > Can you please use powers-of-2 instead? That way it cannot happen that one error masks out another. > Imagine somehow we should only add `300` (I3), but instead add `100 + 200` (I1 and I2). Makes sense, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23947#discussion_r2200850492 From eastigeevich at openjdk.org Fri Jul 11 15:28:47 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Fri, 11 Jul 2025 15:28:47 GMT Subject: Integrated: 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 15:29:10 GMT, Evgeny Astigeevich wrote: > Test compiler/onSpinWait/TestOnSpinWaitAArch64.java needs debug info to identify a position of spin wait instructions in generated code. The test switched to use `XX:CompileCommand=print` instead of `XX:+PrintAssembly` to have assembly only for a tested Java method. In release builds `XX:+PrintAssembly` prints out debug info but `XX:CompileCommand=print` does not. > > This PR reimplements the test to parse instructions and to check them. The test does not rely on debug info anymore. > > Tested on Linux and MacOS with and without hsdis: > - Fastdebug: test passed > - Slowdebug: test passed. > - Release: test passed. This pull request has now been integrated. Changeset: a86dd56d Author: Evgeny Astigeevich URL: https://git.openjdk.org/jdk/commit/a86dd56de34f730b42593236f17118ef5ce4985a Stats: 128 lines in 3 files changed: 34 ins; 71 del; 23 mod 8360936: Test compiler/onSpinWait/TestOnSpinWaitAArch64.java fails after JDK-8359435 Reviewed-by: shade, aph ------------- PR: https://git.openjdk.org/jdk/pull/26072 From mchevalier at openjdk.org Fri Jul 11 17:01:53 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 17:01:53 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store Message-ID: Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. Thanks, Marc ------------- Commit messages: - Improve Load/Store regexes Changes: https://git.openjdk.org/jdk/pull/26269/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26269&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361492 Stats: 253 lines in 2 files changed: 249 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26269/head:pull/26269 PR: https://git.openjdk.org/jdk/pull/26269 From mchevalier at openjdk.org Fri Jul 11 17:01:53 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Fri, 11 Jul 2025 17:01:53 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 16:56:16 GMT, Marc Chevalier wrote: > Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). > > The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. > > Thanks, > Marc test/hotspot/jtreg/compiler/lib/ir_framework/IRNode.java line 2980: > 2978: // @bla: bli:a/b/c$d$e (f/g,h/i/j):NotNull+24 * > 2979: private static final String LOAD_STORE_PREFIX = "@(\\w+: ?)*[\\w/\\$]*\\b"; > 2980: private static final String LOAD_STORE_SUFFIX = "( \\([^\\)]+\\))?(:|\\+)\\S* \\*"; I moved these definitions next to the only place they are used, and should be used. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26269#discussion_r2201316286 From chagedorn at openjdk.org Fri Jul 11 17:17:39 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 11 Jul 2025 17:17:39 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 16:56:16 GMT, Marc Chevalier wrote: > Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). > > The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. > > Thanks, > Marc Nice addition and good tests! Some code style nits in the test code but otherwise, it looks good to me. test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPhaseIRMatching.java line 562: > 560: int i; > 561: } > 562: interface I2{} Suggestion: interface I1 {} static class Base implements I1 { int i; } interface I2 {} test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPhaseIRMatching.java line 567: > 565: } > 566: Base Lb = new Base(); > 567: Derived Ld = new Derived(); Maybe give them a more descriptive name and make them lower case. Same below for `Ldn`. test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPhaseIRMatching.java line 569: > 567: Derived Ld = new Derived(); > 568: > 569: static class SingleNest{ Suggestion: static class SingleNest { ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26269#pullrequestreview-3011251296 PR Review Comment: https://git.openjdk.org/jdk/pull/26269#discussion_r2201357032 PR Review Comment: https://git.openjdk.org/jdk/pull/26269#discussion_r2201362629 PR Review Comment: https://git.openjdk.org/jdk/pull/26269#discussion_r2201357876 From coleenp at openjdk.org Fri Jul 11 17:53:38 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 11 Jul 2025 17:53:38 GMT Subject: RFR: 8361952: Installation of MethodData::extra_data_lock() misses synchronization on reader side In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 09:47:05 GMT, Thomas Schatzl wrote: > Hi all, > > please review this change that fixes some recently introduced atomic installation of a mutex, where the memory barrier (`load_acquire`) on the reader side. Without it the reader might get a valid pointer to the `Mutex` created on the fly, without it being initialized properly. > > Found during code inspection for https://bugs.openjdk.org/browse/JDK-8361706 ; due to some suspicious hangs in the `MutexLocker` while cleaning klasses during class unloading in parallel (multiple threads hanging in `MethodData::clean_method_data`), executing the `vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine/TestDescription.java` test. > > Testing: gha > > Thanks, > Thomas This looks good. ------------- Marked as reviewed by coleenp (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26262#pullrequestreview-3011379256 From jrose at openjdk.org Fri Jul 11 18:00:44 2025 From: jrose at openjdk.org (John R Rose) Date: Fri, 11 Jul 2025 18:00:44 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v11] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> Message-ID: On Fri, 6 Jun 2025 09:01:46 GMT, Roland Westrelin wrote: >> An `Initialize` node for an `Allocate` node is created with a memory >> `Proj` of adr type raw memory. In order for stores to be captured, the >> memory state out of the allocation is a `MergeMem` with slices for the >> various object fields/array element set to the raw memory `Proj` of >> the `Initialize` node. If `Phi`s need to be created during later >> transformations from this memory state, The `Phi` for a particular >> slice gets its adr type from the type of the `Proj` which is raw >> memory. If during macro expansion, the `Allocate` is found to have no >> use and so can be removed, the `Proj` out of the `Initialize` is >> replaced by the memory state on input to the `Allocate`. A `Phi` for >> some slice for a field of an object will end up with the raw memory >> state on input to the `Allocate` node. As a result, memory state at >> the `Phi` is incorrect and incorrect execution can happen. >> >> The fix I propose is, rather than have a single `Proj` for the memory >> state out of the `Initialize` with adr type raw memory, to use one >> `Proj` per slice added to the memory state after the `Initalize`. Each >> of the `Proj` should return the right adr type for its slice. For that >> I propose having a new type of `Proj`: `NarrowMemProj` that captures >> the right adr type. >> >> Logic for the construction of the `Allocate`/`Initialize` subgraph is >> tweaked so the right adr type captured in is own `NarrowMemProj` is >> added to the memory sugraph. Code that removes an allocation or moves >> it also has to be changed so it correctly takes the multiple memory >> projections out of the `Initialize` node into account. >> >> One tricky issue is that when EA split types for a scalar replaceable >> `Allocate` node: >> >> 1- the adr type captured in the `NarrowMemProj` becomes out of sync >> with the type of the slices for the allocation >> >> 2- before EA, the memory state for one particular field out of the >> `Initialize` node can be used for a `Store` to the just allocated >> object or some other. So we can have a chain of `Store`s, some to >> the newly allocated object, some to some other objects, all of them >> using the state of `NarrowMemProj` out of the `Initialize`. After >> split unique types, the `NarrowMemProj` is for the slice of a >> particular allocation. So `Store`s to some other objects shouldn't >> use that memory state but the memory state before the `Allocate`. >> >> For that, I added logic to update the adr type of `NarrowMemProj` >> during split uni... > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > more > > I still think it would be good to include test cases to confirm that these are not only theoretical concerns, but that should not block the progress of this PR. > > Here is a test case ? where all the damage is done early on when EA runs. A pass of loop opts before EA fully unrolls the loop and creates memory `Phi`s with incorrect `adr_type` (raw memory). Then EA removes the allocation. All that keeps the `Store` to `field1` alive then is uncommon traps from template predicates. Once they are removed, the `Store` goes away (first round of loop opts after EA). > > I'll add that test case to the PR. I think the moral of this story is: Any single compiler optimization must never depend on "always going first". If unrelated IR transforms do not commute, something will eventually go wrong, when some unpredictable source code requires (or requests) the fragile, non-commutative optimizations to happen in the "wrong order". Roland, I deeply appreciate your previous comment about fixing root causes, and avoiding shiny workarounds that seem to make the bug go away. (Often the workarounds attempt to restrict the free commutation/reordering of optimizations, by adding contextual checks like extra pattern matching or phase sensistivity.) Such workarounds "seem to work" but usually in the end "fail to work" as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24570#issuecomment-3063227169 From iveresov at openjdk.org Fri Jul 11 18:09:59 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Fri, 11 Jul 2025 18:09:59 GMT Subject: RFR: 8358580: Rethink how classes are kept alive in training data [v2] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 17:10:22 GMT, Igor Veresov wrote: >> Use OopStorage directly instead of JNI handles. Note that we never destroy TrainingData objects, so we don't need to concern ourselves with freeing the OopStorage entries. Also, keeping the klasses alive is only necessary during the training run. During the replay the klasses TD objects refer to are always alive. > > Igor Veresov has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Thanks for the reviews Coleen and Alexey! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26233#issuecomment-3063249658 From iveresov at openjdk.org Fri Jul 11 18:09:59 2025 From: iveresov at openjdk.org (Igor Veresov) Date: Fri, 11 Jul 2025 18:09:59 GMT Subject: Integrated: 8358580: Rethink how classes are kept alive in training data In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 02:37:05 GMT, Igor Veresov wrote: > Use OopStorage directly instead of JNI handles. Note that we never destroy TrainingData objects, so we don't need to concern ourselves with freeing the OopStorage entries. Also, keeping the klasses alive is only necessary during the training run. During the replay the klasses TD objects refer to are always alive. This pull request has now been integrated. Changeset: 59bec29c Author: Igor Veresov URL: https://git.openjdk.org/jdk/commit/59bec29c35361b7b256a2d435ced3458b0c5ea58 Stats: 7 lines in 2 files changed: 1 ins; 3 del; 3 mod 8358580: Rethink how classes are kept alive in training data Reviewed-by: coleenp, shade ------------- PR: https://git.openjdk.org/jdk/pull/26233 From jrose at openjdk.org Fri Jul 11 18:22:50 2025 From: jrose at openjdk.org (John R Rose) Date: Fri, 11 Jul 2025 18:22:50 GMT Subject: RFR: 8327963: C2: fix construction of memory graph around Initialize node to prevent incorrect execution if allocation is removed [v8] In-Reply-To: References: <3jUFOPYDIqmzEywhzf58guwS0qZGBUCMZ3lXeltlS3c=.5c82601f-cf4d-4b2a-a525-1f8f4c7c4a3b@github.com> <1gdeBnZ7YuIf9CgQW2bCXkDDBWPjUgRnickHts-fvzE=.e6e901ba-3e9f-41a2-9c68-167a879e9655@github.com> <2m1_XtiSsW_LaBRrkX4qv7AKtLOjNgnl4mUp3zisasE=.dda62164-7aa0-4c1a-b83f-fa40ba7902e5@github.com> <4374L3lkQK90wLxxOA7POBmIKNX2DFK-4pO4vj1bkuQ=.5b8d7825-a7f1-497f-ab66-02a85a266659@github.com> Message-ID: On Thu, 12 Jun 2025 15:39:35 GMT, Roland Westrelin wrote: > I think it would be good (although not necessarily in the context of this PR) to establish the "no duplicate memory projection" invariant in the back-end, for sanity and to make sure we do not break any logic that might be implicitly relying on it. If you agree, could you file a follow-up RFE, ideally with a reproducer where the current logic fails to remove `NarrowMemProj`s? I see this as a request for a better "normal form" for the graph. The trick here is that, if we are allowing temporary "abnormal" forms of the graph, in order to give various transforms some "working room" to rearrange things, we need to decide when are the moments when the graph must be settled back down into a normal form. We sometimes check for some kinds of IR normality, and/or enforce some normality, in the "final graph reshape" phase. The problem with loading up too many ad hoc operations at that point is, it may create a completely new kind of graph with new invariants. (Don't like the current standard? Create a new one, and see how that goes! Same for global IR contracts.) Having two kinds of IR with two sets of invariants (one set more restrictive) has an obvious objection: We fragment our ability to enforce the rules; we need to write enforcement logic which says "which phase are we in?" before checking the right set of rules. And if the editing sessions are rare, we don't get much benefit from the rules that are enforced by that editing session. By definition "final graph reshape" is rare. It's worth it since we are going to a lower IR, which really must have different rules, but it's not a light thing to add to the design. In any case, adding a normalization requirement seems to need a "wash pass" of some sort over the whole graph, to do necessary cleanups. We do this sometimes, I think, after loop opts or EA, maybe other places, and at "final graph reshape". This is going to be a runtime expense, I think, unless it can be piggybacked on some other pass we already do. Maybe a hallmark of these "post-operative" cleanups is that the operation itself required some side data structure, created just for the operation (loop nest or connection graph) and discarded later in order to unleash unconstrained downstream transforms. During the operation, transforms are specialized just to keep the side data structure relevant. Afterwards, the graph "opens up" to unconstrained changes. But in all cases, local updates should be as free as possible, even if their order varies randomly due to worklist artifacts, etc. (BTW, this is why stress tests on worklist order are valuable.) I'm not advocating firmly for or against new normalizations, but here's a final thought to throw in: Performing normalizations seems to distrust an important design decision, noted in my previous comment. That is, IR transformations should a "confluent" or "commutative"; it should not matter in what order you perform the transforms; you should still get to a better program, with identical user-visible semantics, whichever order you apply the transforms. (Worklist stress tests, again?) Obviously we globally schedule our tactics at the top level, but (in a deep sense) it should rarely matter what order we schedule things, at least as far as correctness goes. And down in the details, at a local level, it really should not matter what form of graph you are working with. Specifically, if we are using narrow memory projections sometimes, we should be prepared to respect them always. (Except, perhaps, at very well defined global cut-points, like final graph reshape, or a comprehensi ve cleanup after episodes of loop opts or EA.) It's been a while since I coded C2 stuff as my day job, but HTH. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24570#issuecomment-3063284364 From mhaessig at openjdk.org Fri Jul 11 20:09:48 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Fri, 11 Jul 2025 20:09:48 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v3] In-Reply-To: References: Message-ID: <9DKaxH7MifcYDT7qC4COJK4ggAUqZBaNInE5sgHuc8U=.36c78071-df29-4461-a924-0fb4f34443b0@github.com> On Tue, 8 Jul 2025 07:40:02 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > re-add package, add methods to Run annotation That exception is indeed a deeper problem. After some investigation, I filed [JDK-8362046](https://bugs.openjdk.org/browse/JDK-8362046). Essentially, the code generation for the byte reverse unsigned short intrinsic on aarch64 does not narrow the result. So your test already paid dividends by exposing this long-standing bug. Thank you! Since this is a fix for a bug in JDK 25 and RDP2 is close, I would suggest that we uncomment the offending line and merge it. But a Reviewer should sign off on that since this is my first rampdown ? For those wondering, why the signed short case is not commented out: code generation on aarch64 will sign extend the result. So far, i have not been able to make that case crash. If I was not creative enough in my endeavours to produce the same behaviour and there is a bug, the RNG will expose it at some point. test/hotspot/jtreg/compiler/c2/gvn/ReverseBytesConstantsTests.java line 93: > 91: Asserts.assertEQ(Short.reverseBytes((short) 0x8070), testS3()); > 92: Asserts.assertEQ(Short.reverseBytes(C_SHORT), testS4()); > 93: Asserts.assertEQ(ReverseBytesConstantsHelper.reverseBytesShort(C_INT), testS5()); Suggestion: // TODO: uncomment after integration of JDK-8362046 // Asserts.assertEQ(ReverseBytesConstantsHelper.reverseBytesShort(C_INT), testS5()); ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/25988#pullrequestreview-3011799731 PR Review Comment: https://git.openjdk.org/jdk/pull/25988#discussion_r2201726167 From duke at openjdk.org Fri Jul 11 20:33:49 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Fri, 11 Jul 2025 20:33:49 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 00:07:01 GMT, Dean Long wrote: > Why not have a new fix_relocation_after_xxx() that is platform-specific? For most platforms it can just delegate to fix_relocation_after_move() I think adding a new `fix_relocation_after_xxx()` might be a bit overkill, since every case except one would just delegate to `fix_relocation_after_move()` anyway. The trampoline handling logic already lives in the shared code `trampoline_stub_Relocation::fix_relocation_after_move`, which calls `pd_fix_owner_after_move` and handles the platform specific scenarios. Since this only comes up during nmethod relocation, I think it makes more sense to keep that logic within the nmethod relocation code itself. This change also keeps all existing code untouched so there isn't a concern about affecting `CodeBuffer::relocate_code_to()` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2201790815 From snatarajan at openjdk.org Fri Jul 11 22:24:54 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Fri, 11 Jul 2025 22:24:54 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test Message-ID: **Issue** The last three parameters of `PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word, int mask, int bits, bool return_fast_path)` are unnecessary after the fix introduced in [JDK-8256425](https://bugs.openjdk.org/browse/JDK-8256425) **Fix** The proposed fix removes the last three parameters and makes the necessary modification to the methods. **Testing** GitHub Actions tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. ------------- Commit messages: - initial fix Changes: https://git.openjdk.org/jdk/pull/26276/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26276&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8353276 Stats: 15 lines in 2 files changed: 0 ins; 10 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/26276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26276/head:pull/26276 PR: https://git.openjdk.org/jdk/pull/26276 From yadongwang at openjdk.org Sat Jul 12 06:17:49 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Sat, 12 Jul 2025 06:17:49 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 12:40:40 GMT, Aleksey Shipilev wrote: > It is such a beautiful bug to read about on Friday. > > So the net effect of this mismatch is that we miss oop relocation/record when `ConP` accidentally mismatches to card table base, did I get that right? > > > Yes, it maybe a better solution for jdk main line, because immPollPage was remove in https://bugs.openjdk.org/browse/JDK-8220051. But how about jdk8u backport? > > I think we should do these things separately: > 1. `immByteMapBase` rule removal in AArch64, this PR, then backport it to 25, 21, 17, maybe to 11, 8 > 2. `immByteMapBase` rule removal in RISC-V, separate PR, then backport it to 25, 21 > 3. `immPollPage` rule removal in AArch64, in 11u and 8u specific PRs, _IF_ we think that is a problem, which I don't think it is. > > The backports for (1) would not be clean, as Generational Shenandoah barrier checks would likely trigger technical conflicts in the code that is being removed. So there is doubly no point in going for clean backports, we should really slice them by the rule we are removing. Agree, very clear strategy. l'll just remove the bytemapbase rule in this pr and do enough tests. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3064755489 From hgreule at openjdk.org Sat Jul 12 08:16:25 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Sat, 12 Jul 2025 08:16:25 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v4] In-Reply-To: References: Message-ID: > Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. > > Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. > > I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. > > Please review. Thanks. Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: comment out Character cases ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25988/files - new: https://git.openjdk.org/jdk/pull/25988/files/f352726e..f8cc3496 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25988&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25988&range=02-03 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/25988.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25988/head:pull/25988 PR: https://git.openjdk.org/jdk/pull/25988 From hgreule at openjdk.org Sat Jul 12 08:16:25 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Sat, 12 Jul 2025 08:16:25 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v3] In-Reply-To: References: Message-ID: <5-Jz3kCmy4CfZonu0ra013G2R2OtG5P4k7tnpSzXTOE=.b7fb7b29-a7f2-4a55-8d9a-8bd5d315a8bd@github.com> On Tue, 8 Jul 2025 07:40:02 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > re-add package, add methods to Run annotation Oh I see, that explains why I didn't run into that problem testing on my x86 machine :) The `Short.valueOf(short)` case doesn't crash because the method checks the lower bound too, it's just that `Character.valueOf(char)` assumes that the given value is >= 0 and therefore accesses the array if the value actually is < 0. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25988#issuecomment-3064923635 From hgreule at openjdk.org Sat Jul 12 08:16:25 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Sat, 12 Jul 2025 08:16:25 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v3] In-Reply-To: <9DKaxH7MifcYDT7qC4COJK4ggAUqZBaNInE5sgHuc8U=.36c78071-df29-4461-a924-0fb4f34443b0@github.com> References: <9DKaxH7MifcYDT7qC4COJK4ggAUqZBaNInE5sgHuc8U=.36c78071-df29-4461-a924-0fb4f34443b0@github.com> Message-ID: On Fri, 11 Jul 2025 19:54:29 GMT, Manuel H?ssig wrote: >> Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: >> >> re-add package, add methods to Run annotation > > test/hotspot/jtreg/compiler/c2/gvn/ReverseBytesConstantsTests.java line 93: > >> 91: Asserts.assertEQ(Short.reverseBytes((short) 0x8070), testS3()); >> 92: Asserts.assertEQ(Short.reverseBytes(C_SHORT), testS4()); >> 93: Asserts.assertEQ(ReverseBytesConstantsHelper.reverseBytesShort(C_INT), testS5()); > > Suggestion: > > // TODO: uncomment after integration of JDK-8362046 > // Asserts.assertEQ(ReverseBytesConstantsHelper.reverseBytesShort(C_INT), testS5()); I assume you mean the Character case here instead. I commented it out for now, but I agree a reviewer should acknowledge that this is the right way to go. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25988#discussion_r2202431063 From fjiang at openjdk.org Sat Jul 12 08:33:39 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Sat, 12 Jul 2025 08:33:39 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: References: Message-ID: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> On Mon, 7 Jul 2025 03:05:25 GMT, Dingli Zhang wrote: >> Hi, please consider this code cleanup change for native call. >> >> This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. >> This also removes several unnecessary code blob related runtime checks turning them into assertions. >> >> ### Testing >> * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Remove outdated comments src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 104: > 102: > 103: CodeBlob *code = CodeCache::find_blob(call_addr); > 104: assert(code != nullptr && code->is_nmethod(), "nmethod expected"); The `if (code != nullptr && code->is_nmethod())` statement indicates that `code->is_nmethod()` may not always be true. Do you think we should use assert here? Looks like the new assert logic is inconsistent with the original code. src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 105: > 103: CodeBlob *code = CodeCache::find_blob(call_addr); > 104: assert(code != nullptr && code->is_nmethod(), "nmethod expected"); > 105: nmethod* nm = code->as_nmethod(); `nm` is only used for `get_trampoline_for(call_addr, nm)`, maybe we can just use `code->as_nmethod()` directly instead of making a new variable. src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 156: > 154: if (code->is_nmethod()) { > 155: nmethod* nm = code->as_nmethod(); > 156: stub_addr = trampoline_stub_Relocation::get_trampoline_for(call_addr, nm); Same here Suggestion: stub_addr = trampoline_stub_Relocation::get_trampoline_for(call_addr, code->as_nmethod()); src/hotspot/cpu/riscv/relocInfo_riscv.cpp line 76: > 74: address Relocation::pd_call_destination(address orig_addr) { > 75: assert(is_call(), "should be a call here"); > 76: if (orig_addr == nullptr) { IIUC, it is synchronized with [JDK-8321509](https://bugs.openjdk.org/browse/JDK-8321509)? Should we add `USE_TRAMPOLINE_STUB_FIX_OWNER` and the other stuff? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2202426556 PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2202428678 PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2202431146 PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2202440479 From fjiang at openjdk.org Sat Jul 12 08:44:40 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Sat, 12 Jul 2025 08:44:40 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> References: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> Message-ID: <0chby2OGAaOlnuBaNhQ1KapoDWyqm9dUgd7sQMJIWY4=.7f97c489-9053-4f01-a8ad-9bfce008d797@github.com> On Sat, 12 Jul 2025 08:24:45 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove outdated comments > > src/hotspot/cpu/riscv/relocInfo_riscv.cpp line 76: > >> 74: address Relocation::pd_call_destination(address orig_addr) { >> 75: assert(is_call(), "should be a call here"); >> 76: if (orig_addr == nullptr) { > > IIUC, it is synchronized with [JDK-8321509](https://bugs.openjdk.org/browse/JDK-8321509)? Should we add `USE_TRAMPOLINE_STUB_FIX_OWNER` and the other stuff? [JDK-8343430](https://bugs.openjdk.org/browse/JDK-8343430) already removed the trampoline call for RISCV, so we are good, right? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2202449018 From yadongwang at openjdk.org Sun Jul 13 08:40:45 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Sun, 13 Jul 2025 08:40:45 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: > The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. > > C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. > > // The assembler store_check code will do an unsigned shift of the oop, > // then add it to _byte_map_base, i.e. > // > // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) > _byte_map = (CardValue*) rs.base(); > _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); > > In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. > > // Card Table Byte Map Base > operand immByteMapBase() > %{ > // Get base of card map > predicate((jbyte*)n->get_ptr() == > ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); > match(ConP); > > op_cost(0); > format %{ %} > interface(CONST_INTER); > %} > > // Load Byte Map Base Constant > instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) > %{ > match(Set dst con); > > ins_cost(INSN_COST); > format %{ "adr $dst, $con\t# Byte Map Base" %} > > ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); > > ins_pipe(ialu_imm); > %} > > As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: > 0xffff25caf08c: ldaxr x8, [x11] > 0xffff25caf090: cmp x10, x8 > 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any > 0xffff25caf098: stlxr w8, x28, [x11] > 0xffff25caf09c: cbnz w8, 0xffff25caf08c > 0xffff25caf0a0: orr x11, xzr, #0x3 > 0xffff25caf0a4: str x11, [x13] > 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none > 0xffff25caf0ac: str x14, [sp] > 0xffff25caf0b0: add x2, sp, #0x20 > 0xffff25caf0b4: adrp x1, 0xffff21730000 > 0xffff25caf0b8: bl 0xffff256fffc0 > 0xffff25caf0bc: ldr x14, [sp] > 0xffff25caf0c0: b 0xffff25caef80 > 0xffff25caf0c4: add x13, sp, #0x20 > 0xffff25caf0c8: adrp x12, 0xffff21730000 > 0xffff25caf0cc: ldr x10, [x13] > 0xffff25caf0d0: cmp x10, xzr > 0xffff25caf0d4: b.eq 0xffff25caf130 // b.none > 0xffff25caf0d8: ldr x11, [x12] > 0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f... Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26249/files - new: https://git.openjdk.org/jdk/pull/26249/files/5b6b5859..e22b34ad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26249&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26249&range=00-01 Stats: 33 lines in 1 file changed: 0 ins; 33 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26249.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26249/head:pull/26249 PR: https://git.openjdk.org/jdk/pull/26249 From aph at openjdk.org Sun Jul 13 09:34:47 2025 From: aph at openjdk.org (Andrew Haley) Date: Sun, 13 Jul 2025 09:34:47 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: References: <73AnlXOv0T8K25DgsNdH1PkBjcBXz0f3bBYZx44LpAw=.439f5383-ffd1-44e8-9e11-4b5af9b6a278@github.com> <3f1UnDuYp2iYVcciKF-BqdChOOY2PJJG5R0QuyfblVM=.37a92dfd-5de4-4924-83c5-f9c2e5d7548c@github.com> Message-ID: <85Fw_Bg0OrMd_LYl4PG_VFqFX2QTdcUK-DFOAxzyjIM=.bdbb0c13-f7a3-458f-a61a-004c6eadc1cc@github.com> On Wed, 2 Jul 2025 20:45:59 GMT, Chad Rakoczy wrote: >> If fixing call sites fails (like in the event of a missing trampoline) an assert will fail and the JVM will crash. I suppose it could be updated to abandon the relocation if that happens but that would require `fix_relocation_after_move` to return if it succeeded and proper handling by the caller. > > This is only an issue because Hotspot reduces the branch range for debug builds on aarch64 and Graal doesn't. If we're going to handle this case I think we should fail fast but it does raise the question of what should actually be done in this situation > If fixing call sites fails (like in the event of a missing trampoline) an assert will fail and the JVM will crash. In what circumstances would a trampoline be missing? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2203261856 From aph at openjdk.org Sun Jul 13 09:39:56 2025 From: aph at openjdk.org (Andrew Haley) Date: Sun, 13 Jul 2025 09:39:56 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v35] In-Reply-To: References: Message-ID: On Tue, 8 Jul 2025 20:03:17 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: > > - Typo > - Merge branch 'master' into JDK-8316694-Final > - Update justification for skipping CallRelocation > - Enclose ImmutableDataReferencesCounterSize in parentheses > - Let trampolines fix their owners > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Update how call sites are fixed > - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final > - Fix pointer printing > - Use set_destination_mt_safe > - ... and 85 more: https://git.openjdk.org/jdk/compare/117f0b40...66d73c16 src/hotspot/share/code/nmethod.cpp line 1392: > 1390: > 1391: > 1392: nmethod::nmethod(nmethod* nm) : CodeBlob(nm->_name, nm->_kind, nm->_size, nm->_header_size) Should this be a copy constructor? nmethod::nmethod(const nmethod &nm) : CodeBlob(nm._name, nm._kind, nm._size, nm._header_size) Even if we don't make it a copy constructor, it looks like its nmethod argument should be `const`, but I haven't checked very deeply. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2203266153 From jkarthikeyan at openjdk.org Sun Jul 13 21:30:52 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sun, 13 Jul 2025 21:30:52 GMT Subject: RFR: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly truncated for byte and short [v9] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 10:05:48 GMT, Tobias Hartmann wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Explicit nullptr checks > > All tests passed. @TobiHartmann @eme64 Thanks a lot for the testing and re-reviews! I've fixed the PR title and will integrate it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25440#issuecomment-3067310856 From jkarthikeyan at openjdk.org Sun Jul 13 21:30:53 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Sun, 13 Jul 2025 21:30:53 GMT Subject: Integrated: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly truncated for byte and short In-Reply-To: References: Message-ID: On Mon, 26 May 2025 07:15:31 GMT, Jasmine Karthikeyan wrote: > Hi all, > This patch fixes cases in SuperWord when compiling subword types where vectorized code would be given a narrower type than expected, leading to miscompilation due to truncation. This fix is a generalization of the same fix applied for `Integer.reverseBytes` in [JDK-8305324](https://bugs.openjdk.org/browse/JDK-8305324). The patch introduces a check for nodes that are known to tolerate truncation, so that any future cases of subword truncation will avoid creating miscompiled code. > > The patch reuses the existing logic to set the type of the vectors to int, which currently disables vectorization for the affected patterns entirely. Once [JDK-8342095](https://bugs.openjdk.org/browse/JDK-8342095) is merged and automatic casting support is added the autovectorizer should automatically insert casts to and from int, maintaining correctness. > > I've added an IR test that checks for correctly compiled outputs. Thoughts and reviews would be appreciated! This pull request has now been integrated. Changeset: 77bd417c Author: Jasmine Karthikeyan URL: https://git.openjdk.org/jdk/commit/77bd417c9990f57525257d9df89b9df4d7991461 Stats: 464 lines in 2 files changed: 460 ins; 0 del; 4 mod 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly truncated for byte and short Reviewed-by: epeter, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/25440 From xgong at openjdk.org Mon Jul 14 02:12:41 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 14 Jul 2025 02:12:41 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 07:04:44 GMT, Xiaohong Gong wrote: > This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. > > ### Background > Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. > > ### Implementation > > #### Challenges > Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. > > For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: > - SPECIES_64: Single operation with mask (8 elements, 256-bit) > - SPECIES_128: Single operation, full register (16 elements, 512-bit) > - SPECIES_256: Two operations + merge (32 elements, 1024-bit) > - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) > > Use `ByteVector.SPECIES_512` as an example: > - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. > - It requires 4 times of vector gather-loads to finish the whole operation. > > > byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] > int[] idx = [0, 1, 2, 3, ..., 63, ...] > > 4 gather-load: > idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] > idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] > idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] > idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] > merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] > > > #### Solution > The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. > > Here is the main changes: > - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. > - Added `VectorSliceNode` for result merging. > - Added `VectorMaskWidenNode` for mask spliting and type conversion fo... Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3067549711 From dholmes at openjdk.org Mon Jul 14 03:46:40 2025 From: dholmes at openjdk.org (David Holmes) Date: Mon, 14 Jul 2025 03:46:40 GMT Subject: RFR: 8361952: Installation of MethodData::extra_data_lock() misses synchronization on reader side In-Reply-To: References: Message-ID: <17de-tEa5r8YpcWhJukEJUvVDjIIXuANurWSfa7m3UQ=.f9745287-4454-4186-8382-0392e328c490@github.com> On Fri, 11 Jul 2025 09:47:05 GMT, Thomas Schatzl wrote: > Hi all, > > please review this change that fixes some recently introduced atomic installation of a mutex, where the memory barrier (`load_acquire`) on the reader side. Without it the reader might get a valid pointer to the `Mutex` created on the fly, without it being initialized properly. > > Found during code inspection for https://bugs.openjdk.org/browse/JDK-8361706 ; due to some suspicious hangs in the `MutexLocker` while cleaning klasses during class unloading in parallel (multiple threads hanging in `MethodData::clean_method_data`), executing the `vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine/TestDescription.java` test. > > Testing: gha > > Thanks, > Thomas Marked as reviewed by dholmes (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26262#pullrequestreview-3014697109 From dzhang at openjdk.org Mon Jul 14 03:49:23 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 03:49:23 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v3] In-Reply-To: References: Message-ID: > Hi, please consider this code cleanup change for native call. > > This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. > This also removes several unnecessary code blob related runtime checks turning them into assertions. > > ### Testing > * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build Dingli Zhang has updated the pull request incrementally with two additional commits since the last revision: - Remove assert_lock for riscv - Use code->as_nmethod() directly in nativeInst_riscv.cpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26150/files - new: https://git.openjdk.org/jdk/pull/26150/files/d7ff8e53..da45cf52 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26150&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26150&range=01-02 Stats: 11 lines in 2 files changed: 0 ins; 3 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/26150.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26150/head:pull/26150 PR: https://git.openjdk.org/jdk/pull/26150 From dzhang at openjdk.org Mon Jul 14 03:52:39 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 03:52:39 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> References: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> Message-ID: On Sat, 12 Jul 2025 08:13:00 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove outdated comments > > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 156: > >> 154: if (code->is_nmethod()) { >> 155: nmethod* nm = code->as_nmethod(); >> 156: stub_addr = trampoline_stub_Relocation::get_trampoline_for(call_addr, nm); > > Same here > Suggestion: > > stub_addr = trampoline_stub_Relocation::get_trampoline_for(call_addr, code->as_nmethod()); Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2203797452 From dzhang at openjdk.org Mon Jul 14 04:27:46 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 04:27:46 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: <0chby2OGAaOlnuBaNhQ1KapoDWyqm9dUgd7sQMJIWY4=.7f97c489-9053-4f01-a8ad-9bfce008d797@github.com> References: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> <0chby2OGAaOlnuBaNhQ1KapoDWyqm9dUgd7sQMJIWY4=.7f97c489-9053-4f01-a8ad-9bfce008d797@github.com> Message-ID: On Sat, 12 Jul 2025 08:41:52 GMT, Feilong Jiang wrote: >> src/hotspot/cpu/riscv/relocInfo_riscv.cpp line 76: >> >>> 74: address Relocation::pd_call_destination(address orig_addr) { >>> 75: assert(is_call(), "should be a call here"); >>> 76: if (orig_addr == nullptr) { >> >> IIUC, it is synchronized with [JDK-8321509](https://bugs.openjdk.org/browse/JDK-8321509)? Should we add `USE_TRAMPOLINE_STUB_FIX_OWNER` and the other stuff? > > [JDK-8343430](https://bugs.openjdk.org/browse/JDK-8343430) already removed the trampoline call for RISCV, so we are good, right? Correct, it is partially synchronized with [JDK-8321509](https://bugs.openjdk.org/browse/JDK-8321509), mainly 4 and 6 [here](https://github.com/openjdk/jdk/pull/19796#issuecomment-2188094970). We do not really need `USE_TRAMPOLINE_STUB_FIX_OWNER` after [JDK-8343430](https://bugs.openjdk.org/browse/JDK-8343430). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2203821966 From dzhang at openjdk.org Mon Jul 14 04:49:40 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 04:49:40 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> References: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> Message-ID: On Sat, 12 Jul 2025 08:04:55 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Remove outdated comments > > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 104: > >> 102: >> 103: CodeBlob *code = CodeCache::find_blob(call_addr); >> 104: assert(code != nullptr && code->is_nmethod(), "nmethod expected"); > > The `if (code != nullptr && code->is_nmethod())` statement indicates that `code->is_nmethod()` may not always be true. Do you think we should use assert here? Looks like the new assert logic is inconsistent with the original code. Following up on 6 [here](https://github.com/openjdk/jdk/pull/19796#issuecomment-2188094970), we also added a fast path in `Relocation::pd_call_destination` , which for ARM64 has the same assertion in `destination()`: https://github.com/openjdk/jdk/blob/73e3e0edeb20c6f701b213423476f92fb05dd262/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp#L53-L64 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2203839770 From thartmann at openjdk.org Mon Jul 14 05:31:53 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Jul 2025 05:31:53 GMT Subject: [jdk25] RFR: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly truncated for byte and short Message-ID: Hi all, This pull request contains a backport of commit [77bd417c](https://github.com/openjdk/jdk/commit/77bd417c9990f57525257d9df89b9df4d7991461) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. The commit being backported was authored by Jasmine Karthikeyan on 13 Jul 2025 and was reviewed by Emanuel Peter and Tobias Hartmann. Thanks! ------------- Commit messages: - Backport 77bd417c9990f57525257d9df89b9df4d7991461 Changes: https://git.openjdk.org/jdk/pull/26286/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26286&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350177 Stats: 464 lines in 2 files changed: 460 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26286.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26286/head:pull/26286 PR: https://git.openjdk.org/jdk/pull/26286 From mchevalier at openjdk.org Mon Jul 14 06:16:32 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 14 Jul 2025 06:16:32 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store [v2] In-Reply-To: References: Message-ID: > Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). > > The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Mostly add spaces and rename, a bit ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26269/files - new: https://git.openjdk.org/jdk/pull/26269/files/70c8f867..3dc823cb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26269&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26269&range=00-01 Stats: 13 lines in 1 file changed: 0 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/26269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26269/head:pull/26269 PR: https://git.openjdk.org/jdk/pull/26269 From mchevalier at openjdk.org Mon Jul 14 06:16:33 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 14 Jul 2025 06:16:33 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 16:56:16 GMT, Marc Chevalier wrote: > Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). > > The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. > > Thanks, > Marc I've added all the spaces and renamed in a non-very original, but hopefully more explicit way. As explicit as a made up class for a test allows. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26269#issuecomment-3067965905 From fjiang at openjdk.org Mon Jul 14 06:54:40 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 14 Jul 2025 06:54:40 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: References: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> <0chby2OGAaOlnuBaNhQ1KapoDWyqm9dUgd7sQMJIWY4=.7f97c489-9053-4f01-a8ad-9bfce008d797@github.com> Message-ID: On Mon, 14 Jul 2025 04:24:58 GMT, Dingli Zhang wrote: > it is partially synchronized with [JDK-8321509](https://bugs.openjdk.org/browse/JDK-8321509), mainly 4 and 6 [here](https://github.com/openjdk/jdk/pull/19796#issuecomment-2188094970). As you mentioned in PR, this is just cleanup code. Do you think we should add this code in a separate PR? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2203984671 From dfenacci at openjdk.org Mon Jul 14 07:03:40 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Mon, 14 Jul 2025 07:03:40 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 06:16:32 GMT, Marc Chevalier wrote: >> Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). >> >> The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Mostly add spaces > > and rename, a bit Well spotted @marc-chevalier and thanks for fixing it! LGTM ------------- Marked as reviewed by dfenacci (Committer). PR Review: https://git.openjdk.org/jdk/pull/26269#pullrequestreview-3014997859 From chagedorn at openjdk.org Mon Jul 14 07:03:39 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 14 Jul 2025 07:03:39 GMT Subject: [jdk25] RFR: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly truncated for byte and short In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 05:26:35 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [77bd417c](https://github.com/openjdk/jdk/commit/77bd417c9990f57525257d9df89b9df4d7991461) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Jasmine Karthikeyan on 13 Jul 2025 and was reviewed by Emanuel Peter and Tobias Hartmann. > > Thanks! Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26286#pullrequestreview-3014998561 From thartmann at openjdk.org Mon Jul 14 07:08:39 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Jul 2025 07:08:39 GMT Subject: [jdk25] RFR: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly truncated for byte and short In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 05:26:35 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [77bd417c](https://github.com/openjdk/jdk/commit/77bd417c9990f57525257d9df89b9df4d7991461) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Jasmine Karthikeyan on 13 Jul 2025 and was reviewed by Emanuel Peter and Tobias Hartmann. > > Thanks! Thanks Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26286#issuecomment-3068084628 From fyang at openjdk.org Mon Jul 14 07:14:38 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 14 Jul 2025 07:14:38 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: References: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> <0chby2OGAaOlnuBaNhQ1KapoDWyqm9dUgd7sQMJIWY4=.7f97c489-9053-4f01-a8ad-9bfce008d797@github.com> Message-ID: On Mon, 14 Jul 2025 06:52:18 GMT, Feilong Jiang wrote: >> Correct, it is partially synchronized with [JDK-8321509](https://bugs.openjdk.org/browse/JDK-8321509), mainly 4 and 6 [here](https://github.com/openjdk/jdk/pull/19796#issuecomment-2188094970). >> We do not really need `USE_TRAMPOLINE_STUB_FIX_OWNER` after [JDK-8343430](https://bugs.openjdk.org/browse/JDK-8343430). > >> it is partially synchronized with [JDK-8321509](https://bugs.openjdk.org/browse/JDK-8321509), mainly 4 and 6 [here](https://github.com/openjdk/jdk/pull/19796#issuecomment-2188094970). > > As you mentioned in PR, this is just cleanup. Should we add this code in a separate PR? I agree with @feilongjiang that we should keep this a cleanup change which shouldn't modify the original logic. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2204016013 From dzhang at openjdk.org Mon Jul 14 07:28:33 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 07:28:33 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v4] In-Reply-To: References: Message-ID: > Hi, please consider this code cleanup change for native call. > > This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. > This also removes several unnecessary code blob related runtime checks turning them into assertions. > > ### Testing > * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Revert changes not related to cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26150/files - new: https://git.openjdk.org/jdk/pull/26150/files/da45cf52..a04d7101 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26150&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26150&range=02-03 Stats: 13 lines in 2 files changed: 6 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/26150.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26150/head:pull/26150 PR: https://git.openjdk.org/jdk/pull/26150 From fyang at openjdk.org Mon Jul 14 07:28:33 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 14 Jul 2025 07:28:33 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v4] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 07:24:35 GMT, Dingli Zhang wrote: >> Hi, please consider this code cleanup change for native call. >> >> This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. >> This also removes several unnecessary code blob related runtime checks turning them into assertions. >> >> ### Testing >> * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Revert changes not related to cleanup Looks good. Thanks for the cleanup. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26150#pullrequestreview-3015050495 From dzhang at openjdk.org Mon Jul 14 07:28:35 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 07:28:35 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: References: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> Message-ID: On Mon, 14 Jul 2025 04:47:01 GMT, Dingli Zhang wrote: >> src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 104: >> >>> 102: >>> 103: CodeBlob *code = CodeCache::find_blob(call_addr); >>> 104: assert(code != nullptr && code->is_nmethod(), "nmethod expected"); >> >> The `if (code != nullptr && code->is_nmethod())` statement indicates that `code->is_nmethod()` may not always be true. Do you think we should use assert here? Looks like the new assert logic is inconsistent with the original code. > > Following up on 6 [here](https://github.com/openjdk/jdk/pull/19796#issuecomment-2188094970), we also added a fast path in `Relocation::pd_call_destination` , which for ARM64 has the same assertion in `destination()`: > https://github.com/openjdk/jdk/blob/73e3e0edeb20c6f701b213423476f92fb05dd262/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp#L53-L64 I will only keep the clean up part and revert the rest of the changes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2204031907 From dzhang at openjdk.org Mon Jul 14 07:28:36 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 07:28:36 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v2] In-Reply-To: References: <0LblAxbmz6kZ08-0YuE91e4AnpxQe-XkPF43tI4Do8s=.3d69b111-e666-4aba-9128-e671fd013215@github.com> <0chby2OGAaOlnuBaNhQ1KapoDWyqm9dUgd7sQMJIWY4=.7f97c489-9053-4f01-a8ad-9bfce008d797@github.com> Message-ID: On Mon, 14 Jul 2025 07:11:34 GMT, Fei Yang wrote: >>> it is partially synchronized with [JDK-8321509](https://bugs.openjdk.org/browse/JDK-8321509), mainly 4 and 6 [here](https://github.com/openjdk/jdk/pull/19796#issuecomment-2188094970). >> >> As you mentioned in PR, this is just cleanup. Should we add this code in a separate PR? > > I agree with @feilongjiang that we should keep this a cleanup change which shouldn't modify the original logic. Thanks all for the review! I will only keep the clean up part and revert the rest of the changes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26150#discussion_r2204025695 From thartmann at openjdk.org Mon Jul 14 07:30:29 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Jul 2025 07:30:29 GMT Subject: Integrated: 8362122: Problem list TestStressBailout until JDK-8361752 is fixed Message-ID: <8--uIYOUgtsBdjOokcvKQz1fQX-5KHbmfGMlSFtt9Q8=.4e866498-7144-4465-bb13-858fb30b2935@github.com> Let's problem list the test until [JDK-8361752](https://bugs.openjdk.org/browse/JDK-8361752) is fixed. The failure seems to be triggered by [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473). Thanks, Tobias ------------- Commit messages: - 8362122: Problem list TestStressBailout until JDK-8361752 is fixed Changes: https://git.openjdk.org/jdk/pull/26288/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26288&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362122 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26288.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26288/head:pull/26288 PR: https://git.openjdk.org/jdk/pull/26288 From chagedorn at openjdk.org Mon Jul 14 07:30:29 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 14 Jul 2025 07:30:29 GMT Subject: Integrated: 8362122: Problem list TestStressBailout until JDK-8361752 is fixed In-Reply-To: <8--uIYOUgtsBdjOokcvKQz1fQX-5KHbmfGMlSFtt9Q8=.4e866498-7144-4465-bb13-858fb30b2935@github.com> References: <8--uIYOUgtsBdjOokcvKQz1fQX-5KHbmfGMlSFtt9Q8=.4e866498-7144-4465-bb13-858fb30b2935@github.com> Message-ID: On Mon, 14 Jul 2025 07:18:44 GMT, Tobias Hartmann wrote: > Let's problem list the test until [JDK-8361752](https://bugs.openjdk.org/browse/JDK-8361752) is fixed. The failure seems to be triggered by [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473). > > Thanks, > Tobias Looks good and trivial. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26288#pullrequestreview-3015042255 From thartmann at openjdk.org Mon Jul 14 07:30:29 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Jul 2025 07:30:29 GMT Subject: Integrated: 8362122: Problem list TestStressBailout until JDK-8361752 is fixed In-Reply-To: <8--uIYOUgtsBdjOokcvKQz1fQX-5KHbmfGMlSFtt9Q8=.4e866498-7144-4465-bb13-858fb30b2935@github.com> References: <8--uIYOUgtsBdjOokcvKQz1fQX-5KHbmfGMlSFtt9Q8=.4e866498-7144-4465-bb13-858fb30b2935@github.com> Message-ID: On Mon, 14 Jul 2025 07:18:44 GMT, Tobias Hartmann wrote: > Let's problem list the test until [JDK-8361752](https://bugs.openjdk.org/browse/JDK-8361752) is fixed. The failure seems to be triggered by [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473). > > Thanks, > Tobias Thanks Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26288#issuecomment-3068135063 From thartmann at openjdk.org Mon Jul 14 07:30:29 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Jul 2025 07:30:29 GMT Subject: Integrated: 8362122: Problem list TestStressBailout until JDK-8361752 is fixed In-Reply-To: <8--uIYOUgtsBdjOokcvKQz1fQX-5KHbmfGMlSFtt9Q8=.4e866498-7144-4465-bb13-858fb30b2935@github.com> References: <8--uIYOUgtsBdjOokcvKQz1fQX-5KHbmfGMlSFtt9Q8=.4e866498-7144-4465-bb13-858fb30b2935@github.com> Message-ID: On Mon, 14 Jul 2025 07:18:44 GMT, Tobias Hartmann wrote: > Let's problem list the test until [JDK-8361752](https://bugs.openjdk.org/browse/JDK-8361752) is fixed. The failure seems to be triggered by [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473). > > Thanks, > Tobias This pull request has now been integrated. Changeset: 7c34bdf7 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/7c34bdf73c063c9c1e1ebdc8e3a02ca3480175e1 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8362122: Problem list TestStressBailout until JDK-8361752 is fixed Reviewed-by: chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/26288 From thartmann at openjdk.org Mon Jul 14 07:34:46 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Jul 2025 07:34:46 GMT Subject: [jdk25] Integrated: 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly truncated for byte and short In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 05:26:35 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [77bd417c](https://github.com/openjdk/jdk/commit/77bd417c9990f57525257d9df89b9df4d7991461) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Jasmine Karthikeyan on 13 Jul 2025 and was reviewed by Emanuel Peter and Tobias Hartmann. > > Thanks! This pull request has now been integrated. Changeset: dd82a092 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/dd82a0922bdf7e3e99edab3246a2a7b5b1cb7bda Stats: 464 lines in 2 files changed: 460 ins; 0 del; 4 mod 8350177: C2 SuperWord: Integer.numberOfLeadingZeros, numberOfTrailingZeros, reverse and bitCount have input types wrongly truncated for byte and short Reviewed-by: chagedorn Backport-of: 77bd417c9990f57525257d9df89b9df4d7991461 ------------- PR: https://git.openjdk.org/jdk/pull/26286 From chagedorn at openjdk.org Mon Jul 14 07:41:45 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 14 Jul 2025 07:41:45 GMT Subject: RFR: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp [v4] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 00:37:10 GMT, Guanqiang Han wrote: >> When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. >> >> This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. > > Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - Remove the unused variable > - Merge remote-tracking branch 'upstream/master' into 8361140 > - update regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - update modification and add regression test > - Merge remote-tracking branch 'upstream/master' into 8361140 > - 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp > > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this support is disabled. Testing was clean ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26125#issuecomment-3068177336 From duke at openjdk.org Mon Jul 14 07:41:46 2025 From: duke at openjdk.org (Guanqiang Han) Date: Mon, 14 Jul 2025 07:41:46 GMT Subject: Integrated: 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 02:49:27 GMT, Guanqiang Han wrote: > When running with `-XX:-OptimizePtrCompare` (which disables pointer comparison optimization), the compiler may hit an assertion failure in debug builds because `optimize_ptr_compare` is still being called. This violates the intended usage of the flag and leads to unexpected crashes. > > This patch adds an early return to `reduce_phi_on_cmp` when `OptimizePtrCompare` is false. Since the optimization relies on `optimize_ptr_compare` for static reasoning about comparisons, there's no benefit in proceeding with `reduce_phi_on_cmp` when this flag is disabled. This pull request has now been integrated. Changeset: 14c79be1 Author: han gq Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/14c79be1613c9d737a9536087ac48914ee4ba8d9 Stats: 110 lines in 3 files changed: 107 ins; 1 del; 2 mod 8361140: Missing OptimizePtrCompare check in ConnectionGraph::reduce_phi_on_cmp Reviewed-by: chagedorn, cslucas ------------- PR: https://git.openjdk.org/jdk/pull/26125 From chagedorn at openjdk.org Mon Jul 14 07:48:39 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 14 Jul 2025 07:48:39 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 21:53:35 GMT, Saranya Natarajan wrote: > **Issue** > The last three parameters of `PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word, int mask, int bits, bool return_fast_path)` are unnecessary after the fix introduced in [JDK-8256425](https://bugs.openjdk.org/browse/JDK-8256425) > > **Fix** > The proposed fix removes the last three parameters and makes the necessary modification to the methods. > > **Testing** > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. Otherwise, it looks good, thanks for cleaning it up! src/hotspot/share/opto/macro.cpp line 98: > 96: Node* PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word) { > 97: Node* cmp; > 98: cmp = word; Could now be merged (I cannot make a direct suggestion due to deleted lines): Node* cmp = word; ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26276#pullrequestreview-3015116536 PR Review Comment: https://git.openjdk.org/jdk/pull/26276#discussion_r2204080988 From bmaillard at openjdk.org Mon Jul 14 08:05:34 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 14 Jul 2025 08:05:34 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v4] In-Reply-To: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: > This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. > > By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) > - [x] tier1-3, plus some internal testing > > Thank you for reviewing! Beno?t Maillard has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Merge branch 'master' into JDK-8361144 - 8361144: add comment for consistency with node count - 8361144: update comment Co-authored-by: Damon Fenacci - 8361144: remove unintentional line break - 8361144: move hash check after return value check and use same format as unique counter check - 8361144: add check for node hash after verifying ideal ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26064/files - new: https://git.openjdk.org/jdk/pull/26064/files/75f81296..8660d6ae Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26064&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26064&range=02-03 Stats: 18711 lines in 635 files changed: 10936 ins; 3423 del; 4352 mod Patch: https://git.openjdk.org/jdk/pull/26064.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26064/head:pull/26064 PR: https://git.openjdk.org/jdk/pull/26064 From jbhateja at openjdk.org Mon Jul 14 08:17:41 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 08:17:41 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: On Tue, 8 Jul 2025 22:44:55 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > rename to paired_push and paired_pop src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 114: > 112: __ paired_push(rax); > 113: } > 114: __ paired_push(rcx); Hi @vamsi-parasa , for consecutive push/pop we can use push2/pop2 and 16byte alignment can be guaranteed using following technique https://github.com/openjdk/jdk/pull/25351/files#diff-d5d721ebf93346ba66e81257e4f6c5e6268d59774313c61e97353c0dfbf686a5R94 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2204155927 From jbhateja at openjdk.org Mon Jul 14 08:23:46 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 08:23:46 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: Message-ID: On Wed, 2 Jul 2025 23:28:42 GMT, Srinivas Vamsi Parasa wrote: >> For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker in the stub snippets using push/pop instruction sequence and wrap the actual assembler call underneath. The idea here is to catch the balancing error upfront as PPX is purely a performance hint. Instructions with this hint have the same functional semantics as those without. PPX hints set by the compiler that violate the balancing rule may turn off the PPX >> optimization, but they will not affect program semantics.. >> >> >> class APXPushPopPairTracker { >> private: >> int _counter; >> >> public: >> APXPushPopPairTracker() _counter(0) { >> } >> >> ~APXPushPopPairTracker() { >> assert(_counter == 0, "Push/pop pair mismatch"); >> } >> >> void push(Register reg, bool has_matching_pop) { >> if (has_matching_pop && VM_Version::supports_apx_f()) { >> Assembler::pushp(reg); >> incrementCounter(); >> } else { >> Assembler::push(reg); >> } >> } >> void pop(Register reg, bool has_matching_push) { >> if (has_matching_push && VM_Version::supports_apx_f()) { >> Assembler::popp(reg); >> decrementCounter(); >> } else { >> Assembler::pop(reg); >> } >> } >> void incrementCounter() { >> _counter++; >> } >> void decrementCounter() { >> _counter--; >> } >> } > > Hi Jatin (@jatin-bhateja) and Vlad (@vpaprotsk), > > There's one more issue to be considered. The C++ PushPopTracker code will be run during the stub generation time. There are code bocks which do a single push onto the stack but due to multiple exit paths, there will be multiple pops as illustrated below. Will this reference counting approach not fail in such a scenario as the stub code is generated all at once during the stub generation phase? > > > #begin stack frame > push(r21) > > #exit condition 1 > pop(r21) > > # exit condition 2 > pop(r21) There is no one size fits all soution, idea is to be smart whereever possible, by maintaining a fixed stack of registers populated during push operation we can delegate the responsibility of emitting pop instructions in reverse order to tracker, @vamsi-parasa for now I am ok with maintaining existing implimentation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2204183436 From tschatzl at openjdk.org Mon Jul 14 09:02:50 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 14 Jul 2025 09:02:50 GMT Subject: RFR: 8361952: Installation of MethodData::extra_data_lock() misses synchronization on reader side In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 12:23:18 GMT, Aleksey Shipilev wrote: >> Hi all, >> >> please review this change that fixes some recently introduced atomic installation of a mutex, where the memory barrier (`load_acquire`) on the reader side. Without it the reader might get a valid pointer to the `Mutex` created on the fly, without it being initialized properly. >> >> Found during code inspection for https://bugs.openjdk.org/browse/JDK-8361706 ; due to some suspicious hangs in the `MutexLocker` while cleaning klasses during class unloading in parallel (multiple threads hanging in `MethodData::clean_method_data`), executing the `vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine/TestDescription.java` test. >> >> Testing: gha >> >> Thanks, >> Thomas > > OK, sure. Thanks @shipilev @coleenp @dholmes-ora for your reviews ------------- PR Comment: https://git.openjdk.org/jdk/pull/26262#issuecomment-3068518246 From tschatzl at openjdk.org Mon Jul 14 09:02:51 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Mon, 14 Jul 2025 09:02:51 GMT Subject: Integrated: 8361952: Installation of MethodData::extra_data_lock() misses synchronization on reader side In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 09:47:05 GMT, Thomas Schatzl wrote: > Hi all, > > please review this change that fixes some recently introduced atomic installation of a mutex, where the memory barrier (`load_acquire`) on the reader side. Without it the reader might get a valid pointer to the `Mutex` created on the fly, without it being initialized properly. > > Found during code inspection for https://bugs.openjdk.org/browse/JDK-8361706 ; due to some suspicious hangs in the `MutexLocker` while cleaning klasses during class unloading in parallel (multiple threads hanging in `MethodData::clean_method_data`), executing the `vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine/TestDescription.java` test. > > Testing: gha > > Thanks, > Thomas This pull request has now been integrated. Changeset: 272e66d0 Author: Thomas Schatzl URL: https://git.openjdk.org/jdk/commit/272e66d017a3497d9af4df6f042c741ad8a59dd6 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8361952: Installation of MethodData::extra_data_lock() misses synchronization on reader side Reviewed-by: shade, coleenp, dholmes ------------- PR: https://git.openjdk.org/jdk/pull/26262 From adinn at openjdk.org Mon Jul 14 09:17:39 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 14 Jul 2025 09:17:39 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Sun, 13 Jul 2025 08:40:45 GMT, Yadong Wang wrote: >> The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. >> >> C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. >> >> // The assembler store_check code will do an unsigned shift of the oop, >> // then add it to _byte_map_base, i.e. >> // >> // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) >> _byte_map = (CardValue*) rs.base(); >> _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); >> >> In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. >> >> // Card Table Byte Map Base >> operand immByteMapBase() >> %{ >> // Get base of card map >> predicate((jbyte*)n->get_ptr() == >> ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); >> match(ConP); >> >> op_cost(0); >> format %{ %} >> interface(CONST_INTER); >> %} >> >> // Load Byte Map Base Constant >> instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) >> %{ >> match(Set dst con); >> >> ins_cost(INSN_COST); >> format %{ "adr $dst, $con\t# Byte Map Base" %} >> >> ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); >> >> ins_pipe(ialu_imm); >> %} >> >> As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: >> 0xffff25caf08c: ldaxr x8, [x11] >> 0xffff25caf090: cmp x10, x8 >> 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any >> 0xffff25caf098: stlxr w8, x28, [x11] >> 0xffff25caf09c: cbnz w8, 0xffff25caf08c >> 0xffff25caf0a0: orr x11, xzr, #0x3 >> 0xffff25caf0a4: str x11, [x13] >> 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none >> 0xffff25caf0ac: str x14, [sp] >> 0xffff25caf0b0: add x2, sp, #0x20 >> 0xffff25caf0b4: adrp x1, 0xffff21730000 >> 0xffff25caf0b8: bl 0xffff256fffc0 >> 0xffff25caf0bc: ldr x14, [sp] >> 0xffff25caf0c0: b 0xffff25caef80 >> 0xffff25caf0c4: add x13, sp, #0x20 >> 0xffff25caf0c8: adrp x12, 0xffff21730000 >> 0xffff25caf0cc: ldr x10, [x13] >> 0xffff25caf0d0: cmp x10, xzr >> 0xffff25c... > > Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: > > 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding The proposed solution is not simply going to work when the Leyden project introduces code save/restore. In the assembly phase for an AOT cache (i.e when compiling code to store in the cache) we need to recognize that an incoming address is the byte map base address and generate an lea with an external address relocation. So, the current code in premain relies on matching against immByteMapBase. I believe we can retain the current approach if we make the immByteMapBase predicate test the operand type for a RawPtr rather than an OopPtr. That is actually the key distinction that separates the cases we are dealing with. I believe it would work for both immByteMapBase and immPollPage and is a much smaller change to the status quo. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3068574613 From mhaessig at openjdk.org Mon Jul 14 09:21:39 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 14 Jul 2025 09:21:39 GMT Subject: RFR: 8360701: Add bailout when the register allocator interference graph grows unreasonably large In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 18:13:13 GMT, Daniel Lund?n wrote: > The changeset for JDK-8325467 (https://git.openjdk.org/jdk/pull/20404) enables compilation of methods with many parameters, which C2 previously bailed out on. As a side effect, the tests `BigArityTest.java`, `TestCatchExceptionWithVarargs.java`, and `VarargsArrayTest.java` compile more methods than before, and additionally these methods are designed, for stress testing purposes, to have a large number of parameters (at or close to the maximum of 255 parameters allowed by the JVM spec). > > Compiling such methods takes a very long time and >99% of the time is spent in the C2 phase Coalesce 2 (part of register allocation). The problem is that the interference graph becomes huge after the initial round of spilling (just before Coalesce 2), and that we do not check for this and bail out if necessary. We do already bail out if the number of IR nodes grows too large, but the interference graph can become huge even if we have a small number of nodes. In fact, the interference graph may (in the worst case) hava a size that is quadratic in the number of nodes. In the problematic tests, we have interference graphs with approximately 100 000 nodes and over 55 000 000 (!) IFG edges. For comparison, the IFG edge count in worst-case realistic scenarios caps out at around 40 000 nodes and 800 000 edges. For example, see the scatter matrix below from running the DaCapo benchmark. It displays, for each time an IFG was built, the number of current IR nodes, the number of live ranges ( the actual nodes in the IFG), and the number of IFG edges. > > ![dacapo](https://github.com/user-attachments/assets/7a070768-50da-42e4-b5ed-9958e1362673) > > ### Changeset > > - Add a new diagnostic flag `IFGEdgesLimit` and bail out whenever we reach the number of edges specified by the flag during IFG construction. The default is a very generous 10 000 000 edges, that still filters out the most degenerate compilations we have seen. > - Add tracking of edges in `PhaseIFG` to permit the new flag. > > It is worth noting that it is perhaps preferable to use a lower default than 10 000 000 edges. For example, in standard benchmarks such as DaCapo (see the scatter matrix above), Renaissance, SPECjvm, and SPECjbb, we never go over 1 000 000 edges (I verified this). The reason I went with the generous 10 000 000 limit is that I saw a fair amount of bailouts in testing with the flag set at 1 000 000 edges. Such bailouts are likely motivated, but I do not want to take any chances. Even at 10 000 000 edges, a few tests s... Thank you for working on this @dlunde! Overall, this change looks good to me. I only have one nit and a question. If I understand correctly, this change is mostly about very long compilation times. Bailing out will lead to shorter compilation times but slower execution of the methods. You benchmarked compilation time, but can you elaborate more, why this won't cause regressions in the execution time? For instance, what is the difference in execution time of the tests that now hit the limit vs. before your change? src/hotspot/share/opto/c2_globals.hpp line 269: > 267: \ > 268: product(uint, IFGEdgesLimit, 10000000, DIAGNOSTIC, \ > 269: "Maximum allowed edges in interference graphs") \ Suggestion: "Maximum allowed edges in the interference graphs") \ Nit: usually, the flag descriptions use "the" ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26118#pullrequestreview-3015463061 PR Review Comment: https://git.openjdk.org/jdk/pull/26118#discussion_r2204309434 From shade at openjdk.org Mon Jul 14 09:43:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Jul 2025 09:43:38 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 09:15:14 GMT, Andrew Dinn wrote: > So, the current code in premain relies on matching against immByteMapBase. Wait, but only AArch64/RISC-V have `immByteMapBase`, so I presume other platforms deal with the need to record relocation for card table base through some other means? With special `immByteMapBase` removed, would the same way work for AArch64? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3068660737 From bkilambi at openjdk.org Mon Jul 14 09:47:42 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Jul 2025 09:47:42 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 02:10:07 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance! Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3068674585 From aph at openjdk.org Mon Jul 14 09:48:39 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 14 Jul 2025 09:48:39 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: <3-lkyT9mKB-3kNYx4XG7rRcu-PDYuWlF1M8jBB1uBSY=.b7b98943-0404-4fbb-834b-a511bb6725da@github.com> On Mon, 14 Jul 2025 09:15:14 GMT, Andrew Dinn wrote: > I believe it would work for both immByteMapBase and immPollPage and is a much smaller change to the status quo. Poll page does this. No steenkin' reloc necessary... // Move the address of the polling page into dest. void MacroAssembler::get_polling_page(Register dest, relocInfo::relocType rtype) { ldr(dest, Address(rthread, JavaThread::polling_page_offset())); } ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3068681486 From xgong at openjdk.org Mon Jul 14 10:13:43 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 14 Jul 2025 10:13:43 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 02:10:07 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance! > Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon. Testing on 256-bit SVE machines are fine to me. Thanks so much for your help! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3068787386 From chagedorn at openjdk.org Mon Jul 14 10:27:41 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 14 Jul 2025 10:27:41 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 06:16:32 GMT, Marc Chevalier wrote: >> Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). >> >> The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Mostly add spaces > > and rename, a bit test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPhaseIRMatching.java line 575: > 573: } > 574: > 575: SingleNest.DoubleNest double_nest = new SingleNest.DoubleNest(); Last nit: We should use camelCase for Java code: Suggestion: SingleNest.DoubleNest doubleNest = new SingleNest.DoubleNest(); test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPhaseIRMatching.java line 763: > 761: // @ir_framework/tests/LoadStore$SingleNest$DoubleNest+12 * > 762: public int loadDoubleNested() { > 763: return double_nest.i; Suggestion: return doubleNest.i; test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/TestPhaseIRMatching.java line 786: > 784: // @ir_framework/tests/LoadStore$SingleNest$DoubleNest+12 * > 785: public void storeDoubleNested() { > 786: double_nest.i = 1; Suggestion: doubleNest.i = 1; ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26269#discussion_r2204498393 PR Review Comment: https://git.openjdk.org/jdk/pull/26269#discussion_r2204498723 PR Review Comment: https://git.openjdk.org/jdk/pull/26269#discussion_r2204498990 From mchevalier at openjdk.org Mon Jul 14 10:30:55 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 14 Jul 2025 10:30:55 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store [v3] In-Reply-To: References: Message-ID: > Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). > > The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. > > Thanks, > Marc Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: ocamlCase ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26269/files - new: https://git.openjdk.org/jdk/pull/26269/files/3dc823cb..c281fbea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26269&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26269&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/26269.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26269/head:pull/26269 PR: https://git.openjdk.org/jdk/pull/26269 From shade at openjdk.org Mon Jul 14 10:31:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Jul 2025 10:31:41 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Sun, 13 Jul 2025 08:40:45 GMT, Yadong Wang wrote: >> The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. >> >> C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. >> >> // The assembler store_check code will do an unsigned shift of the oop, >> // then add it to _byte_map_base, i.e. >> // >> // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) >> _byte_map = (CardValue*) rs.base(); >> _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); >> >> In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. >> >> // Card Table Byte Map Base >> operand immByteMapBase() >> %{ >> // Get base of card map >> predicate((jbyte*)n->get_ptr() == >> ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); >> match(ConP); >> >> op_cost(0); >> format %{ %} >> interface(CONST_INTER); >> %} >> >> // Load Byte Map Base Constant >> instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) >> %{ >> match(Set dst con); >> >> ins_cost(INSN_COST); >> format %{ "adr $dst, $con\t# Byte Map Base" %} >> >> ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); >> >> ins_pipe(ialu_imm); >> %} >> >> As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: >> 0xffff25caf08c: ldaxr x8, [x11] >> 0xffff25caf090: cmp x10, x8 >> 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any >> 0xffff25caf098: stlxr w8, x28, [x11] >> 0xffff25caf09c: cbnz w8, 0xffff25caf08c >> 0xffff25caf0a0: orr x11, xzr, #0x3 >> 0xffff25caf0a4: str x11, [x13] >> 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none >> 0xffff25caf0ac: str x14, [sp] >> 0xffff25caf0b0: add x2, sp, #0x20 >> 0xffff25caf0b4: adrp x1, 0xffff21730000 >> 0xffff25caf0b8: bl 0xffff256fffc0 >> 0xffff25caf0bc: ldr x14, [sp] >> 0xffff25caf0c0: b 0xffff25caef80 >> 0xffff25caf0c4: add x13, sp, #0x20 >> 0xffff25caf0c8: adrp x12, 0xffff21730000 >> 0xffff25caf0cc: ldr x10, [x13] >> 0xffff25caf0d0: cmp x10, xzr >> 0xffff25c... > > Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: > > 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding To be clear, I would very much prefer to remove the special-case handling for card table base in AArch64/RISC-V AD, rather than piling on more special cases into that rule. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3068863982 From chagedorn at openjdk.org Mon Jul 14 10:36:39 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 14 Jul 2025 10:36:39 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store [v3] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 10:30:55 GMT, Marc Chevalier wrote: >> Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). >> >> The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > ocamlCase Looks good, thanks for all the updates! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26269#pullrequestreview-3015768745 From duke at openjdk.org Mon Jul 14 11:07:45 2025 From: duke at openjdk.org (duke) Date: Mon, 14 Jul 2025 11:07:45 GMT Subject: RFR: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal [v4] In-Reply-To: References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Mon, 14 Jul 2025 08:05:34 GMT, Beno?t Maillard wrote: >> This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. >> >> By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) >> - [x] tier1-3, plus some internal testing >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Merge branch 'master' into JDK-8361144 > - 8361144: add comment for consistency with node count > - 8361144: update comment > > Co-authored-by: Damon Fenacci > - 8361144: remove unintentional line break > - 8361144: move hash check after return value check and use same format as unique counter check > - 8361144: add check for node hash after verifying ideal @benoitmaillard Your change (at version 8660d6ae5903131b25fa02dd2c2eb59c80699cd0) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26064#issuecomment-3069009411 From adinn at openjdk.org Mon Jul 14 11:10:39 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 14 Jul 2025 11:10:39 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 09:40:37 GMT, Aleksey Shipilev wrote: > Wait, but only AArch64/RISC-V have immByteMapBase, so I presume other platforms deal with the need to record relocation for card table base through some other means? With special immByteMapBase removed, would the same way work for AArch64? I'm not sure how this works for the byte_map_base on x86. I just looked thorugh the x86 premain code and I cannot find any special case handling for it. However, on both x86 and aarch64 we also use immediate ConP operand rules to inject relocs for constant addresses that refer to entries in the (global) AOT Runtime Constants area. So, this is another case similar to byte_map_base where we AOT compilation needs to recognize a special address and handle ir appropriately. Here are the x86 rules for AOT Constant addressses // AOT Runtime Constants Address operand immAOTRuntimeConstantsAddress() %{ // Check if the address is in the range of AOT Runtime Constants predicate(AOTRuntimeConstants::contains((address)(n->get_ptr()))); match(ConP); op_cost(0); format %{ %} interface(CONST_INTER); %} . . . instruct loadAOTRCAddress(rRegP dst, immAOTRuntimeConstantsAddress con) %{ match(Set dst con); format %{ "leaq $dst, $con\t# AOT Runtime Constants Address" %} ins_encode %{ __ load_aotrc_address($dst$$Register, (address)$con$$constant); %} ins_pipe(ialu_reg_fat); %} The referenced macro assembler method is defined as follows void MacroAssembler::load_aotrc_address(Register reg, address a) { #if INCLUDE_CDS assert(AOTRuntimeConstants::contains(a), "address out of range for data area"); if (AOTCodeCache::is_on_for_dump()) { // all aotrc field addresses should be registered in the AOTCodeCache address table lea(reg, ExternalAddress(a)); } else { mov64(reg, (uint64_t)a); } #else ShouldNotReachHere(); #endif } I'm not sure whether the latest premain code is just in flux and does not yet deal with byte_map_base on x86 or whether there is some other mechanism. Perhaps @ashu-mehra or @vnkozlov can clarify. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3069024341 From adinn at openjdk.org Mon Jul 14 11:10:40 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 14 Jul 2025 11:10:40 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Sun, 13 Jul 2025 08:40:45 GMT, Yadong Wang wrote: >> The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. >> >> C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. >> >> // The assembler store_check code will do an unsigned shift of the oop, >> // then add it to _byte_map_base, i.e. >> // >> // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) >> _byte_map = (CardValue*) rs.base(); >> _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); >> >> In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. >> >> // Card Table Byte Map Base >> operand immByteMapBase() >> %{ >> // Get base of card map >> predicate((jbyte*)n->get_ptr() == >> ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); >> match(ConP); >> >> op_cost(0); >> format %{ %} >> interface(CONST_INTER); >> %} >> >> // Load Byte Map Base Constant >> instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) >> %{ >> match(Set dst con); >> >> ins_cost(INSN_COST); >> format %{ "adr $dst, $con\t# Byte Map Base" %} >> >> ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); >> >> ins_pipe(ialu_imm); >> %} >> >> As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: >> 0xffff25caf08c: ldaxr x8, [x11] >> 0xffff25caf090: cmp x10, x8 >> 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any >> 0xffff25caf098: stlxr w8, x28, [x11] >> 0xffff25caf09c: cbnz w8, 0xffff25caf08c >> 0xffff25caf0a0: orr x11, xzr, #0x3 >> 0xffff25caf0a4: str x11, [x13] >> 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none >> 0xffff25caf0ac: str x14, [sp] >> 0xffff25caf0b0: add x2, sp, #0x20 >> 0xffff25caf0b4: adrp x1, 0xffff21730000 >> 0xffff25caf0b8: bl 0xffff256fffc0 >> 0xffff25caf0bc: ldr x14, [sp] >> 0xffff25caf0c0: b 0xffff25caef80 >> 0xffff25caf0c4: add x13, sp, #0x20 >> 0xffff25caf0c8: adrp x12, 0xffff21730000 >> 0xffff25caf0cc: ldr x10, [x13] >> 0xffff25caf0d0: cmp x10, xzr >> 0xffff25c... > > Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: > > 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding To be clear: it is not just the card table base that needs special case handling in Leyden. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3069032031 From bkilambi at openjdk.org Mon Jul 14 11:12:33 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Jul 2025 11:12:33 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v14] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Updated x86 code. Patch contributed by @jatin-bhateja ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23570/files - new: https://git.openjdk.org/jdk/pull/23570/files/34566e7d..8fcab4f1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=12-13 Stats: 11 lines in 2 files changed: 3 ins; 6 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From bkilambi at openjdk.org Mon Jul 14 11:17:41 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Jul 2025 11:17:41 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Addressed review comments to half the number of match rules ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23570/files - new: https://git.openjdk.org/jdk/pull/23570/files/8fcab4f1..6c7266d7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=13-14 Stats: 254 lines in 4 files changed: 61 ins; 143 del; 50 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From bkilambi at openjdk.org Mon Jul 14 11:17:42 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Jul 2025 11:17:42 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 03:15:24 GMT, Xiaohong Gong wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2919: >> >>> 2917: ins(tmp, D, src2, 1, 0); >>> 2918: tbl(dst, size1, tmp, 1, dst); >>> 2919: } >> >> Is it better than we wrap this part as a help function, because the code is much the same with line2885-2898? > > These two functions can be refined more clearly. Following is my version: > > void C2_MacroAssembler::select_from_two_vectors_neon(FloatRegister dst, FloatRegister src1, > FloatRegister src2, FloatRegister index, > FloatRegister tmp, unsigned length_in_bytes) { > assert_different_registers(dst, src1, src2, tmp); > SIMD_Arrangement size = length_in_bytes == 16 ? T16B : T8B; > > if (length_in_bytes == 16) { > assert(UseSVE <= 1, "sve must be <= 1"); > // If the vector length is 16B, then use the Neon "tbl" instruction with two vector table > tbl(dst, size, src1, 2, index); > } else { // vector length == 8 > assert(UseSVE == 0, "must be Neon only"); > // We need to fit both the source vectors (src1, src2) in a 128-bit register because the > // Neon "tbl" instruction supports only looking up 16B vectors. We then use the Neon "tbl" > // instruction with one vector lookup > ins(tmp, D, src1, 0, 0); > ins(tmp, D, src2, 1, 0); > tbl(dst, size, tmp, 1, index); > } > } > > void C2_MacroAssembler::select_from_two_vectors_sve(FloatRegister dst, FloatRegister src1, > FloatRegister src2, FloatRegister index, > FloatRegister tmp, BasicType bt, > unsigned length_in_bytes) { > assert_different_registers(dst, src1, src2, index, tmp); > SIMD_RegVariant T = elemType_to_regVariant(bt); > if (length_in_bytes == 8) { > assert(UseSVE >= 1, "must be"); > ins(tmp, D, src1, 0, 0); > ins(tmp, D, src2, 1, 0); > sve_tbl(dst, T, tmp, index); > } else { > assert(UseSVE == 2 && length_in_bytes == MaxVectorSize, "must be"); > sve_tbl(dst, T, src1, src2, index); > } > } > > void C2_MacroAssembler::select_from_two_vectors(FloatRegister dst, FloatRegister src1, > FloatRegister src2, FloatRegister index, > FloatRegister tmp, BasicType bt, > unsigned length_in_bytes) { > > assert_different_registers(dst, src1, src2, index, tmp); > > if (UseSVE == 2 || (UseSVE == 1 && length_in_bytes == 8)) { > select_from_two_vectors_sve(dst, src1, src2, index, tmp, bt, length_in_bytes); > return; > } > > // The only BasicTypes that can reach here are T_SHORT, T_BYTE, T_INT and T_FLOAT > assert(bt != T_DOUBLE ... Hi @XiaohongGong , I have updated the code based on your suggestions. I did feel that the code could be slightly less readable but given that we could half the number of rules in the ad file, I felt this compromise should be ok. Thanks a lot for pointing this out. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2204633514 From adinn at openjdk.org Mon Jul 14 11:21:49 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 14 Jul 2025 11:21:49 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 10:28:47 GMT, Aleksey Shipilev wrote: >> Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: >> >> 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding > > To be clear, I would very much prefer to remove the special-case handling for card table base in AArch64/RISC-V AD, rather than piling on more special cases into that rule. @shipilev One other thing. I believe that the only code that creates a RawPtr ConP for the card table base is in the shared C2 card table barrier set assembler. Now that we have moved to late barrier insertion for G1 (along with Shenandoah and Z) is there now any code -- in the dev tree or in jdk25 -- that will create one of these nodes? If not then perhaps the aarch64 immByteMapBase and loadByteMapBase rules are no longer needed and should be deleted? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3069103564 From aph at openjdk.org Mon Jul 14 11:23:45 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 14 Jul 2025 11:23:45 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 11:17:41 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments to half the number of match rules src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2858: > 2856: > 2857: // Implement selecting from two vectors using Neon instructions > 2858: void C2_MacroAssembler::select_from_two_vectors_neon(FloatRegister dst, FloatRegister src1, Need a comment here saying what this does. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2204652557 From fjiang at openjdk.org Mon Jul 14 11:30:42 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 14 Jul 2025 11:30:42 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v4] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 07:28:33 GMT, Dingli Zhang wrote: >> Hi, please consider this code cleanup change for native call. >> >> This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. >> This also removes several unnecessary code blob related runtime checks turning them into assertions. >> >> ### Testing >> * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Revert changes not related to cleanup Looks good, thanks for the cleanup! ------------- Marked as reviewed by fjiang (Committer). PR Review: https://git.openjdk.org/jdk/pull/26150#pullrequestreview-3015981589 From bkilambi at openjdk.org Mon Jul 14 11:32:44 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Jul 2025 11:32:44 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 11:20:57 GMT, Andrew Haley wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments to half the number of match rules > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2858: > >> 2856: >> 2857: // Implement selecting from two vectors using Neon instructions >> 2858: void C2_MacroAssembler::select_from_two_vectors_neon(FloatRegister dst, FloatRegister src1, > > Need a comment here saying what this does. Thanks for the comment. I did mention on line #2857 that this implements selecting from two vectors using Neon instructions. Do you think I should add more description here? I have also added the conditions on which this function gets called in `C2_MacroAssembler::select_from_two_vectors()`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2204669842 From bmaillard at openjdk.org Mon Jul 14 11:42:44 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 14 Jul 2025 11:42:44 GMT Subject: Integrated: 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal In-Reply-To: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> References: <9YpmCSNKHrTmq54eLusmkTHoEFFUTvm6OiqjdiGNFv0=.f8123888-bd26-42cc-938a-ec756a0da90d@github.com> Message-ID: On Tue, 1 Jul 2025 11:35:06 GMT, Beno?t Maillard wrote: > This PR adds a node hash comparison after calling `Ideal` in `PhaseIterGVN::verify_Ideal_for` to introduce an additional layer of verification for missed optimizations. Previously, we relied on the return value of `Ideal`, which is expected to be `nullptr` if no transformation was done. > > By also checking the node's hash before and after `Ideal`, we could catch inconsistencies in the implementation or unintended modifications to the graph. Both of these may indicate missed or incomplete optimizations. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361144) > - [x] tier1-3, plus some internal testing > > Thank you for reviewing! This pull request has now been integrated. Changeset: a531c9ae Author: Beno?t Maillard Committer: Damon Fenacci URL: https://git.openjdk.org/jdk/commit/a531c9aece200d27d7870595eee8e14e39e9bd00 Stats: 12 lines in 1 file changed: 11 ins; 0 del; 1 mod 8361144: Strenghten the Ideal Verification in PhaseIterGVN::verify_Ideal_for by comparing the hash of a node before and after Ideal Co-authored-by: Emanuel Peter Reviewed-by: galder, dfenacci, epeter ------------- PR: https://git.openjdk.org/jdk/pull/26064 From dzhang at openjdk.org Mon Jul 14 11:54:38 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 11:54:38 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v4] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 07:28:33 GMT, Dingli Zhang wrote: >> Hi, please consider this code cleanup change for native call. >> >> This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. >> This also removes several unnecessary code blob related runtime checks turning them into assertions. >> >> ### Testing >> * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Revert changes not related to cleanup Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26150#issuecomment-3069196216 From duke at openjdk.org Mon Jul 14 11:54:39 2025 From: duke at openjdk.org (duke) Date: Mon, 14 Jul 2025 11:54:39 GMT Subject: RFR: 8361449: RISC-V: Code cleanup for native call [v4] In-Reply-To: References: Message-ID: <8-zkg2vu7IXxEnmJ7OGuQX-xbBy6Nydzs_dMeYF-01c=.34425293-5c59-4849-a64c-a3f09e707e5f@github.com> On Mon, 14 Jul 2025 07:28:33 GMT, Dingli Zhang wrote: >> Hi, please consider this code cleanup change for native call. >> >> This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. >> This also removes several unnecessary code blob related runtime checks turning them into assertions. >> >> ### Testing >> * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Revert changes not related to cleanup @DingliZhang Your change (at version a04d7101d06bfa7df30e3e741a610f79d2e9d09b) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26150#issuecomment-3069199030 From dzhang at openjdk.org Mon Jul 14 11:58:49 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 14 Jul 2025 11:58:49 GMT Subject: Integrated: 8361449: RISC-V: Code cleanup for native call In-Reply-To: References: Message-ID: On Mon, 7 Jul 2025 02:30:48 GMT, Dingli Zhang wrote: > Hi, please consider this code cleanup change for native call. > > This removes the address parameter for NativeCall::reloc_destination and NativeFarCall::reloc_destination. > This also removes several unnecessary code blob related runtime checks turning them into assertions. > > ### Testing > * [x] hs:tier1 - hs:tier3 tested with linux-riscv64 fastdebug build This pull request has now been integrated. Changeset: 5edd5465 Author: Dingli Zhang Committer: Feilong Jiang URL: https://git.openjdk.org/jdk/commit/5edd546585d66f52c2e894ed212ee67945fe0785 Stats: 39 lines in 3 files changed: 4 ins; 10 del; 25 mod 8361449: RISC-V: Code cleanup for native call Reviewed-by: fyang, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/26150 From jbhateja at openjdk.org Mon Jul 14 12:20:45 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 12:20:45 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v13] In-Reply-To: References: Message-ID: <-olTdjMIhNFfAwGbtWC5xswpKbgM_6uPJBgqoL-joJg=.83566f34-801e-449d-b613-dc2f81f40e54@github.com> > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23947/files - new: https://git.openjdk.org/jdk/pull/23947/files/c79efe09..7be678b5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=11-12 Stats: 131 lines in 2 files changed: 70 ins; 37 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/23947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23947/head:pull/23947 PR: https://git.openjdk.org/jdk/pull/23947 From jbhateja at openjdk.org Mon Jul 14 12:20:46 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 12:20:46 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v13] In-Reply-To: <-qsTG7NyclV8PbQ1CsbHobu0bCwIK-6JvsMhmzmpVtg=.51d21506-9da9-4bf2-93d8-6907a6b54c5b@github.com> References: <2YFuLETRIRASPPjocbdhIGklH-45xnIVuY6cYrAdIzU=.84c661ff-faf8-49e8-9c05-056bb9a0fcab@github.com> <-qsTG7NyclV8PbQ1CsbHobu0bCwIK-6JvsMhmzmpVtg=.51d21506-9da9-4bf2-93d8-6907a6b54c5b@github.com> Message-ID: On Tue, 3 Jun 2025 13:30:05 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/intrinsicnode.cpp line 241: >> >>> 239: jlong lo = bt == T_INT ? min_jint : min_jlong; >>> 240: >>> 241: if(mask_type->is_con() && mask_type->get_con_as_long(bt) != -1L) { >> >> Now you removed the condition `mask_type->get_con_as_long(bt) != -1L`. Do you know why it was there in the first place? >> >> It seems to me that if `mask_type->get_con_as_long(bt) == -1L`, then we can just return the type of `src`, right? > > This is a bug-fix for `CompressBitsNode::Value`, but this change also has an effect on `ExpandBitsNode::Value`, and that makes me a little nervous. For example: do we have enough test coverage for `expand`? It seems we did not have enough tests for `compress`, so probably also not for `expand`... > Now you removed the condition `mask_type->get_con_as_long(bt) != -1L`. Do you know why it was there in the first place? > > It seems to me that if `mask_type->get_con_as_long(bt) == -1L`, then we can just return the type of `src`, right? Correct, also for non-constant masks we can even find the maximum set bit count of entier value range and then estimate the bounds of the result. #include #include #include #include uint64_t popcnt(int64_t val) { uint64_t res = 0; asm volatile("popcntq %1, %0 " : "=r"(res) : "r"(val) : "cc"); return res; } typedef struct _result { int64_t value; int64_t mask; int64_t count; } result; result compute_max_mask(int64_t hi, int64_t lo) { assert(hi >= lo); result res; int64_t max_true_bits = 0; int64_t max_true_bits_val = 0; for (int64_t iter = lo; iter < hi; iter++) { int setbitscnt = popcnt(iter); if (max_true_bits < setbitscnt) { max_true_bits = setbitscnt; max_true_bits_val = iter; } } res.value = max_true_bits_val; res.mask = (1L << max_true_bits) - 1L; res.count = max_true_bits; return res; } int main(int argc, char* argv[]) { if (argc != 3) { return printf("Invalid arguments, [lo] [hi]\n"); } int64_t lo = atol(argv[1]); int64_t hi = atol(argv[2]); result mask = compute_max_mask(hi, lo); return printf("[lo] = %ld [hi] = %ld [set bits count] %ld [mask] = %ld [max bits count value] %ld\n", lo, hi, mask.count, mask.mask, mask.value); } For this bug fix patch I don't want to include this extension. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23947#discussion_r2204754331 From jbhateja at openjdk.org Mon Jul 14 12:20:48 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 12:20:48 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v9] In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 09:14:36 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix aarch64 failure > > src/hotspot/share/opto/intrinsicnode.cpp line 267: > >> 265: // mask = 0xEFFFFFFF (constant mask) >> 266: // result.hi = 0x7FFFFFFF >> 267: // result.lo = 0 > > Should shit not go inside the `CompressBits` scope? > `Hi` -> `lo` > > `Result.Hi = popcount(1 << mask_bits - 1)` > Does not look right. Is this not the wrong way around? > Just repeating code here also does not make sense. Either give a reason in english, or just drop the duplication if it is indeed trivail. > > I would also do the case distinction a bit clearer: > > If mask == -1 -> all ones -> just returns src: > result.lo = type_min (happens if src = type_min) > > Question: does that not mean we could just return the input type of `src`? > > If mask != -1 -> at least one zero in mask -> result cannot be negative: > result.lo = 0 > > > But if we are doing this with the comments, then why not just create an `if-else` block, and add the comments inside each block? > ``` > If mask == -1 -> all ones -> just returns src: > > Question: does that not mean we could just return the input type of `src`? This is incorrect, bit compression simply compacts the bits corresponding to set mask bits, thus for all true mask, if the mask of the source is 1, along with some other set bits, but if it includes atleast one unset (zero) bit then result will be a +ve value and not same as src. > src/hotspot/share/opto/intrinsicnode.cpp line 292: > >> 290: // To compute minimum result value we assume all but last read source bit as zero, >> 291: // this is because sign bit of result will always be set to 1 while other bit >> 292: // corresponding to set mask bit should be zero. > > I don't understand, are you talking about `lo` if `mask < 0`? Don't we just keep `lo = type_min`, which is always ok? Correct, that is what the comment explains, all bits apart from MSB bit are zero type_min 0x80000000 (int) and similarly for long... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23947#discussion_r2204754423 PR Review Comment: https://git.openjdk.org/jdk/pull/23947#discussion_r2204754550 From fjiang at openjdk.org Mon Jul 14 12:21:46 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 14 Jul 2025 12:21:46 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v2] In-Reply-To: <5PCmTwnensUBsUNqVlxUuK6L2nDHIOqek7KEH5r_h_M=.9a05eebc-f3ba-4b0e-b0e0-76e89661c89d@github.com> References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> <5PCmTwnensUBsUNqVlxUuK6L2nDHIOqek7KEH5r_h_M=.9a05eebc-f3ba-4b0e-b0e0-76e89661c89d@github.com> Message-ID: On Thu, 10 Jul 2025 09:26:53 GMT, Dingli Zhang wrote: >> Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. >> So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. >> >> ### Test >> qemu-system UseRVV: >> * [x] Run jdk_vector (fastdebug) >> * [x] Run compiler/vectorapi (fastdebug) >> >> ### Performance >> Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): >> >> >> Benchmark (SIZE) Mode Units Before After Gain >> VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 >> VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 >> VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 >> VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 >> >> PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Adjust the position of comment src/hotspot/cpu/riscv/riscv.ad line 1999: > 1997: } else if (bt == T_SHORT) { > 1998: // To support vector type conversions between short and wider types. > 1999: size = 2; Should we add some `assert` or `guarantee` for uncovered types? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26239#discussion_r2204758147 From jbhateja at openjdk.org Mon Jul 14 12:22:46 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 12:22:46 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v12] In-Reply-To: References: Message-ID: <7uuAHG1r4neGrCaP_VDnXWEfIrDeHH7iY5FBEH3hEjQ=.6dd6553c-efe1-4fa0-b722-5560468693b6@github.com> On Fri, 11 Jul 2025 14:16:03 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Update test Hi @TobiHartmann , @eme64 , all your comments have been addressed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3069278272 From aph at openjdk.org Mon Jul 14 12:51:47 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 14 Jul 2025 12:51:47 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 11:30:01 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2858: >> >>> 2856: >>> 2857: // Implement selecting from two vectors using Neon instructions >>> 2858: void C2_MacroAssembler::select_from_two_vectors_neon(FloatRegister dst, FloatRegister src1, >> >> Need a comment here saying what this does. > > Thanks for the comment. I did mention on line #2857 that this implements selecting from two vectors using Neon instructions. Do you think I should add more description here? I have also added the conditions on which this function gets called in `C2_MacroAssembler::select_from_two_vectors()`. Yes, you should. "Select items (on what basis?) from register(s) x and y and place them (in what order?) in dest. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2204857589 From thartmann at openjdk.org Mon Jul 14 13:24:45 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Jul 2025 13:24:45 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v13] In-Reply-To: <-olTdjMIhNFfAwGbtWC5xswpKbgM_6uPJBgqoL-joJg=.83566f34-801e-449d-b613-dc2f81f40e54@github.com> References: <-olTdjMIhNFfAwGbtWC5xswpKbgM_6uPJBgqoL-joJg=.83566f34-801e-449d-b613-dc2f81f40e54@github.com> Message-ID: On Mon, 14 Jul 2025 12:20:45 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Review comments resolutions `compiler/intrinsics/TestBitShuffleOpers.java` fails with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/open/src/hotspot/share/opto/intrinsicnode.cpp:315), pid=3588220, tid=3588247 # Error: assert(lo == (T_INT ? min_jint : min_jlong)) failed # # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-14-1229127.tobias.hartmann.jdk4) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-14-1229127.tobias.hartmann.jdk4, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x1066720] bitshuffle_value(TypeInteger const*, TypeInteger const*, int, BasicType)+0x440 # urrent CompileTask: C2:2810 490 % b compiler.intrinsics.TestBitShuffleOpers::test17 @ 7 (1042 bytes) Stack: [0x00007f2910b4c000,0x00007f2910c4c000], sp=0x00007f2910c47e30, free space=1007k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x1066720] bitshuffle_value(TypeInteger const*, TypeInteger const*, int, BasicType)+0x440 (intrinsicnode.cpp:315) V [libjvm.so+0x182ddbf] PhaseGVN::transform(Node*)+0x1cf (phaseX.cpp:703) V [libjvm.so+0x14913a2] LibraryCallKit::inline_bitshuffle_methods(vmIntrinsicID)+0xb2 (library_call.cpp:2244) V [libjvm.so+0x14bcff8] LibraryCallKit::try_to_inline(int)+0x1b8 (library_call.cpp:556) V [libjvm.so+0x14bfea0] LibraryIntrinsic::generate(JVMState*)+0x230 (library_call.cpp:119) V [libjvm.so+0xd1aaa2] Parse::do_call()+0x712 (doCall.cpp:677) V [libjvm.so+0x17ff8b8] Parse::do_one_bytecode()+0x4b8 (parse2.cpp:2723) V [libjvm.so+0x17eca9c] Parse::do_one_block()+0x24c (parse1.cpp:1586) V [libjvm.so+0x17edea0] Parse::do_all_blocks()+0x130 (parse1.cpp:724) V [libjvm.so+0x17f1393] Parse::Parse(JVMState*, ciMethod*, float)+0xaa3 (parse1.cpp:628) V [libjvm.so+0x97cb6b] ParseGenerator::generate(JVMState*)+0x13b (callGenerator.cpp:97) V [libjvm.so+0xb54c49] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x18b9 (compile.cpp:804) V [libjvm.so+0x97a437] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x467 (c2compiler.cpp:141) V [libjvm.so+0xb64698] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xb58 (compileBroker.cpp:2324) V [libjvm.so+0xb65868] CompileBroker::compiler_thread_loop()+0x578 (compileBroker.cpp:1968) V [libjvm.so+0x10bcacb] JavaThread::thread_main_inner()+0x13b (javaThread.cpp:773) V [libjvm.so+0x1b2ed66] Thread::call_run()+0xb6 (thread.cpp:243) V [libjvm.so+0x179e8d8] thread_native_entry(Thread*)+0x128 (os_linux.cpp:868) ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3069604021 From bkilambi at openjdk.org Mon Jul 14 13:26:39 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Jul 2025 13:26:39 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 07:04:44 GMT, Xiaohong Gong wrote: > This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. > > ### Background > Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. > > ### Implementation > > #### Challenges > Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. > > For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: > - SPECIES_64: Single operation with mask (8 elements, 256-bit) > - SPECIES_128: Single operation, full register (16 elements, 512-bit) > - SPECIES_256: Two operations + merge (32 elements, 1024-bit) > - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) > > Use `ByteVector.SPECIES_512` as an example: > - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. > - It requires 4 times of vector gather-loads to finish the whole operation. > > > byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] > int[] idx = [0, 1, 2, 3, ..., 63, ...] > > 4 gather-load: > idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] > idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] > idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] > idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] > merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] > > > #### Solution > The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. > > Here is the main changes: > - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. > - Added `VectorSliceNode` for result merging. > - Added `VectorMaskWidenNode` for mask spliting and type conversion fo... src/hotspot/cpu/aarch64/aarch64_vector.ad line 5990: > 5988: %} > 5989: > 5990: instruct vmaskwiden_hi_sve(pReg dst, pReg src) %{ can both the hi and lo widen rules be combined into a single one as the arguments are the same? or would it make it less understandable? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2204939113 From bkilambi at openjdk.org Mon Jul 14 13:36:48 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 14 Jul 2025 13:36:48 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 07:04:44 GMT, Xiaohong Gong wrote: > This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. > > ### Background > Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. > > ### Implementation > > #### Challenges > Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. > > For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: > - SPECIES_64: Single operation with mask (8 elements, 256-bit) > - SPECIES_128: Single operation, full register (16 elements, 512-bit) > - SPECIES_256: Two operations + merge (32 elements, 1024-bit) > - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) > > Use `ByteVector.SPECIES_512` as an example: > - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. > - It requires 4 times of vector gather-loads to finish the whole operation. > > > byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] > int[] idx = [0, 1, 2, 3, ..., 63, ...] > > 4 gather-load: > idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] > idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] > idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] > idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] > merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] > > > #### Solution > The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. > > Here is the main changes: > - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. > - Added `VectorSliceNode` for result merging. > - Added `VectorMaskWidenNode` for mask spliting and type conversion fo... src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 352: > 350: // SVE requires vector indices for gather-load/scatter-store operations > 351: // on all data types. > 352: bool Matcher::gather_scatter_needs_vector_index(BasicType bt) { There's already a function that tests for `UseSVE > 0` here - https://github.com/openjdk/jdk/blob/bcd86d575fe0682a234228c18b0c2e817d3816da/src/hotspot/cpu/aarch64/matcher_aarch64.hpp#L36 Can it be reused? src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3430: > 3428: > 3429: instruct vslice_neon(vReg dst, vReg src1, vReg src2, immI index) %{ > 3430: predicate(VM_Version::use_neon_for_vector(Matcher::vector_length_in_bytes(n))); nit: indentation. I think there're 3 spaces here.. Same with the SVE version below. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3434: > 3432: format %{ "vslice_neon $dst, $src1, $src2, $index" %} > 3433: ins_encode %{ > 3434: uint length_in_bytes = Matcher::vector_length_in_bytes(this); nit: indentation. two spaces.. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3448: > 3446: format %{ "vslice_sve $dst_src1, $dst_src1, $src2, $index" %} > 3447: ins_encode %{ > 3448: assert(UseSVE > 0, "must be sve"); nit: indentation. two spaces.. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2204954269 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2204961131 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2204958060 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2204959807 From mchevalier at openjdk.org Mon Jul 14 13:39:44 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 14 Jul 2025 13:39:44 GMT Subject: RFR: 8361492: [IR Framework] Has too restrictive regex for load and store [v3] In-Reply-To: References: Message-ID: <0MMhqUGu3wmj0WV4V4r5IA1sUt-V_JXyMNntiMnpYrs=.f973e6b2-b0ad-43eb-b28c-4068402cbcc5@github.com> On Mon, 14 Jul 2025 10:30:55 GMT, Marc Chevalier wrote: >> Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). >> >> The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > ocamlCase Thanks @dafedafe and @chhagedorn! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26269#issuecomment-3069643740 From mchevalier at openjdk.org Mon Jul 14 13:39:45 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 14 Jul 2025 13:39:45 GMT Subject: Integrated: 8361492: [IR Framework] Has too restrictive regex for load and store In-Reply-To: References: Message-ID: <8kft4VsEijQQc8qZ1FMwJcRiSG5FzLh3YKrBM2rOWBU=.7ef7ab6c-6931-4de0-a773-a780b993f1fe@github.com> On Fri, 11 Jul 2025 16:56:16 GMT, Marc Chevalier wrote: > Improving store and load regexes + adding test. It's mostly an improve version of a fix I had to do in Valhalla where it was blocking (part of JDK-8361250, blocking JDK-8357785). > > The new regex takes into account that classes can implement interfaces, nested classes, and various labels after the `@`. It should be more robust. > > Thanks, > Marc This pull request has now been integrated. Changeset: ebb10958 Author: Marc Chevalier URL: https://git.openjdk.org/jdk/commit/ebb1095805579f8f32a81bb350198fa1b7add9eb Stats: 253 lines in 2 files changed: 249 ins; 2 del; 2 mod 8361492: [IR Framework] Has too restrictive regex for load and store Reviewed-by: chagedorn, dfenacci ------------- PR: https://git.openjdk.org/jdk/pull/26269 From jbhateja at openjdk.org Mon Jul 14 13:48:07 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 13:48:07 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v14] In-Reply-To: References: Message-ID: > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Broken assertions fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23947/files - new: https://git.openjdk.org/jdk/pull/23947/files/7be678b5..06eafe77 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=12-13 Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/23947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23947/head:pull/23947 PR: https://git.openjdk.org/jdk/pull/23947 From jbhateja at openjdk.org Mon Jul 14 13:48:08 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 13:48:08 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v13] In-Reply-To: References: <-olTdjMIhNFfAwGbtWC5xswpKbgM_6uPJBgqoL-joJg=.83566f34-801e-449d-b613-dc2f81f40e54@github.com> Message-ID: <260VpdfrxR3vKnrlKQPuVwzJJ3lXM6liDFV4mi-7swg=.eff991a1-0b1f-4fe5-bc31-69896640e654@github.com> On Mon, 14 Jul 2025 13:22:29 GMT, Tobias Hartmann wrote: > -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation Thanks @TobiHartmann , kindly verify with the latest version. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3069675244 From thartmann at openjdk.org Mon Jul 14 14:59:56 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 14 Jul 2025 14:59:56 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v13] In-Reply-To: <260VpdfrxR3vKnrlKQPuVwzJJ3lXM6liDFV4mi-7swg=.eff991a1-0b1f-4fe5-bc31-69896640e654@github.com> References: <-olTdjMIhNFfAwGbtWC5xswpKbgM_6uPJBgqoL-joJg=.83566f34-801e-449d-b613-dc2f81f40e54@github.com> <260VpdfrxR3vKnrlKQPuVwzJJ3lXM6liDFV4mi-7swg=.eff991a1-0b1f-4fe5-bc31-69896640e654@github.com> Message-ID: <8NlUra4EtAcDl_kYEQoD72fVjzZhhIw20Vp1BTmmtdg=.aa1d2a9c-2c40-41b1-ba60-e0918058fe8b@github.com> On Mon, 14 Jul 2025 13:45:07 GMT, Jatin Bhateja wrote: >> `compiler/intrinsics/TestBitShuffleOpers.java` fails with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`: >> >> >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # Internal Error (/workspace/open/src/hotspot/share/opto/intrinsicnode.cpp:315), pid=3588220, tid=3588247 >> # Error: assert(lo == (T_INT ? min_jint : min_jlong)) failed >> # >> # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-14-1229127.tobias.hartmann.jdk4) >> # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-14-1229127.tobias.hartmann.jdk4, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64) >> # Problematic frame: >> # V [libjvm.so+0x1066720] bitshuffle_value(TypeInteger const*, TypeInteger const*, int, BasicType)+0x440 >> # >> >> urrent CompileTask: >> C2:2810 490 % b compiler.intrinsics.TestBitShuffleOpers::test17 @ 7 (1042 bytes) >> >> Stack: [0x00007f2910b4c000,0x00007f2910c4c000], sp=0x00007f2910c47e30, free space=1007k >> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) >> V [libjvm.so+0x1066720] bitshuffle_value(TypeInteger const*, TypeInteger const*, int, BasicType)+0x440 (intrinsicnode.cpp:315) >> V [libjvm.so+0x182ddbf] PhaseGVN::transform(Node*)+0x1cf (phaseX.cpp:703) >> V [libjvm.so+0x14913a2] LibraryCallKit::inline_bitshuffle_methods(vmIntrinsicID)+0xb2 (library_call.cpp:2244) >> V [libjvm.so+0x14bcff8] LibraryCallKit::try_to_inline(int)+0x1b8 (library_call.cpp:556) >> V [libjvm.so+0x14bfea0] LibraryIntrinsic::generate(JVMState*)+0x230 (library_call.cpp:119) >> V [libjvm.so+0xd1aaa2] Parse::do_call()+0x712 (doCall.cpp:677) >> V [libjvm.so+0x17ff8b8] Parse::do_one_bytecode()+0x4b8 (parse2.cpp:2723) >> V [libjvm.so+0x17eca9c] Parse::do_one_block()+0x24c (parse1.cpp:1586) >> V [libjvm.so+0x17edea0] Parse::do_all_blocks()+0x130 (parse1.cpp:724) >> V [libjvm.so+0x17f1393] Parse::Parse(JVMState*, ciMethod*, float)+0xaa3 (parse1.cpp:628) >> V [libjvm.so+0x97cb6b] ParseGenerator::generate(JVMState*)+0x13b (callGenerator.cpp:97) >> V [libjvm.so+0xb54c49] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x18b9 (compile.cpp:804) >> V [libjvm.so+0x97a437] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x467 (c2compiler.cpp:141) >> V [libjvm.so+0xb64698] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xb58 (co... > >> -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation > > Thanks @TobiHartmann , kindly verify with the latest version. @jatin-bhateja This is with the latest version (webrev 13). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3069918597 From dlunden at openjdk.org Mon Jul 14 15:14:22 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 14 Jul 2025 15:14:22 GMT Subject: RFR: 8360701: Add bailout when the register allocator interference graph grows unreasonably large [v2] In-Reply-To: References: Message-ID: > The changeset for JDK-8325467 (https://git.openjdk.org/jdk/pull/20404) enables compilation of methods with many parameters, which C2 previously bailed out on. As a side effect, the tests `BigArityTest.java`, `TestCatchExceptionWithVarargs.java`, and `VarargsArrayTest.java` compile more methods than before, and additionally these methods are designed, for stress testing purposes, to have a large number of parameters (at or close to the maximum of 255 parameters allowed by the JVM spec). > > Compiling such methods takes a very long time and >99% of the time is spent in the C2 phase Coalesce 2 (part of register allocation). The problem is that the interference graph becomes huge after the initial round of spilling (just before Coalesce 2), and that we do not check for this and bail out if necessary. We do already bail out if the number of IR nodes grows too large, but the interference graph can become huge even if we have a small number of nodes. In fact, the interference graph may (in the worst case) hava a size that is quadratic in the number of nodes. In the problematic tests, we have interference graphs with approximately 100 000 nodes and over 55 000 000 (!) IFG edges. For comparison, the IFG edge count in worst-case realistic scenarios caps out at around 40 000 nodes and 800 000 edges. For example, see the scatter matrix below from running the DaCapo benchmark. It displays, for each time an IFG was built, the number of current IR nodes, the number of live ranges ( the actual nodes in the IFG), and the number of IFG edges. > > ![dacapo](https://github.com/user-attachments/assets/7a070768-50da-42e4-b5ed-9958e1362673) > > ### Changeset > > - Add a new diagnostic flag `IFGEdgesLimit` and bail out whenever we reach the number of edges specified by the flag during IFG construction. The default is a very generous 10 000 000 edges, that still filters out the most degenerate compilations we have seen. > - Add tracking of edges in `PhaseIFG` to permit the new flag. > > It is worth noting that it is perhaps preferable to use a lower default than 10 000 000 edges. For example, in standard benchmarks such as DaCapo (see the scatter matrix above), Renaissance, SPECjvm, and SPECjbb, we never go over 1 000 000 edges (I verified this). The reason I went with the generous 10 000 000 limit is that I saw a fair amount of bailouts in testing with the flag set at 1 000 000 edges. Such bailouts are likely motivated, but I do not want to take any chances. Even at 10 000 000 edges, a few tests s... Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/c2_globals.hpp Co-authored-by: Manuel H?ssig ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26118/files - new: https://git.openjdk.org/jdk/pull/26118/files/d71d9a55..4beaa8a1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26118&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26118&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26118.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26118/head:pull/26118 PR: https://git.openjdk.org/jdk/pull/26118 From dlunden at openjdk.org Mon Jul 14 15:14:23 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Mon, 14 Jul 2025 15:14:23 GMT Subject: RFR: 8360701: Add bailout when the register allocator interference graph grows unreasonably large [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 09:19:20 GMT, Manuel H?ssig wrote: > Thank you for working on this @dlunde! Overall, this change looks good to me. I only have one nit and a question. Thanks for the review @mhaessig! > You benchmarked compilation time, but can you elaborate more, why this won't cause regressions in the execution time? I did run standard benchmarks as well! See the below from the PR description (I should have made it more explicit). > For example, in standard benchmarks such as DaCapo (see the scatter matrix above), Renaissance, SPECjvm, and SPECjbb, we never go over 1 000 000 edges (I verified this). That is, the _maximum_ IFG edge count among all compilations in this (hopefully) diverse and representative set of benchmarks is just below 1 000 000 edges. The limit is 10 times that. Stated alternatively, we never bail out in practice with this new limit. So, combined with the fact that compilation time is unaffected, the total execution time is unaffected. > For instance, what is the difference in execution time of the tests that now hit the limit vs. before your change? Good question! I checked this now on all Oracle-supported platforms and there is no clear difference in total execution time for the three tests I mentioned in the PR description. I ran them all without any additional flags. > src/hotspot/share/opto/c2_globals.hpp line 269: > >> 267: \ >> 268: product(uint, IFGEdgesLimit, 10000000, DIAGNOSTIC, \ >> 269: "Maximum allowed edges in interference graphs") \ > > Suggestion: > > "Maximum allowed edges in the interference graphs") \ > > Nit: usually, the flag descriptions use "the" Sure, added! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26118#issuecomment-3069963462 PR Review Comment: https://git.openjdk.org/jdk/pull/26118#discussion_r2205183222 From jbhateja at openjdk.org Mon Jul 14 15:16:50 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 14 Jul 2025 15:16:50 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v13] In-Reply-To: <260VpdfrxR3vKnrlKQPuVwzJJ3lXM6liDFV4mi-7swg=.eff991a1-0b1f-4fe5-bc31-69896640e654@github.com> References: <-olTdjMIhNFfAwGbtWC5xswpKbgM_6uPJBgqoL-joJg=.83566f34-801e-449d-b613-dc2f81f40e54@github.com> <260VpdfrxR3vKnrlKQPuVwzJJ3lXM6liDFV4mi-7swg=.eff991a1-0b1f-4fe5-bc31-69896640e654@github.com> Message-ID: On Mon, 14 Jul 2025 13:45:07 GMT, Jatin Bhateja wrote: >> `compiler/intrinsics/TestBitShuffleOpers.java` fails with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`: >> >> >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # Internal Error (/workspace/open/src/hotspot/share/opto/intrinsicnode.cpp:315), pid=3588220, tid=3588247 >> # Error: assert(lo == (T_INT ? min_jint : min_jlong)) failed >> # >> # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-14-1229127.tobias.hartmann.jdk4) >> # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-14-1229127.tobias.hartmann.jdk4, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64) >> # Problematic frame: >> # V [libjvm.so+0x1066720] bitshuffle_value(TypeInteger const*, TypeInteger const*, int, BasicType)+0x440 >> # >> >> urrent CompileTask: >> C2:2810 490 % b compiler.intrinsics.TestBitShuffleOpers::test17 @ 7 (1042 bytes) >> >> Stack: [0x00007f2910b4c000,0x00007f2910c4c000], sp=0x00007f2910c47e30, free space=1007k >> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) >> V [libjvm.so+0x1066720] bitshuffle_value(TypeInteger const*, TypeInteger const*, int, BasicType)+0x440 (intrinsicnode.cpp:315) >> V [libjvm.so+0x182ddbf] PhaseGVN::transform(Node*)+0x1cf (phaseX.cpp:703) >> V [libjvm.so+0x14913a2] LibraryCallKit::inline_bitshuffle_methods(vmIntrinsicID)+0xb2 (library_call.cpp:2244) >> V [libjvm.so+0x14bcff8] LibraryCallKit::try_to_inline(int)+0x1b8 (library_call.cpp:556) >> V [libjvm.so+0x14bfea0] LibraryIntrinsic::generate(JVMState*)+0x230 (library_call.cpp:119) >> V [libjvm.so+0xd1aaa2] Parse::do_call()+0x712 (doCall.cpp:677) >> V [libjvm.so+0x17ff8b8] Parse::do_one_bytecode()+0x4b8 (parse2.cpp:2723) >> V [libjvm.so+0x17eca9c] Parse::do_one_block()+0x24c (parse1.cpp:1586) >> V [libjvm.so+0x17edea0] Parse::do_all_blocks()+0x130 (parse1.cpp:724) >> V [libjvm.so+0x17f1393] Parse::Parse(JVMState*, ciMethod*, float)+0xaa3 (parse1.cpp:628) >> V [libjvm.so+0x97cb6b] ParseGenerator::generate(JVMState*)+0x13b (callGenerator.cpp:97) >> V [libjvm.so+0xb54c49] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x18b9 (compile.cpp:804) >> V [libjvm.so+0x97a437] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x467 (c2compiler.cpp:141) >> V [libjvm.so+0xb64698] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xb58 (co... > >> -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation > > Thanks @TobiHartmann , kindly verify with the latest version. > @jatin-bhateja This is with the latest version (webrev 13). Hi @TobiHartmann I don't see any failure at https://github.com/openjdk/jdk/pull/23947/commits/06eafe7712833d830bbd60cdb729ad261eca59b8 ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3069973582 From shade at openjdk.org Mon Jul 14 15:27:51 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Jul 2025 15:27:51 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 Message-ID: See the bug for more analysis. The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. But, there is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. Additional testing: - [x] Linux AArch64 server fastdebug, `tier1` - [ ] Linux AArch64 server fastdebug, `all` ------------- Commit messages: - More comment touchups - Comment touchup - Remove ProblemList entry - Sample fix Changes: https://git.openjdk.org/jdk/pull/26294/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26294&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361752 Stats: 56 lines in 4 files changed: 13 ins; 39 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26294.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26294/head:pull/26294 PR: https://git.openjdk.org/jdk/pull/26294 From shade at openjdk.org Mon Jul 14 15:35:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Jul 2025 15:35:38 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 13:57:09 GMT, Aleksey Shipilev wrote: > See the bug for more analysis. > > The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. > > There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. > > I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. > > This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `tier1` > - [ ] Linux AArch64 server fastdebug, `all` I am pretty convinced this is it. But I still struggle to reproduce the failure locally. So I would appreciate if @TobiHartmann or @dholmes-ora could give it a spin through the CI where this reproduces. Probably after [JDK-8360048](https://bugs.openjdk.org/browse/JDK-8360048) lands, if that one is not a test-only bug? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26294#issuecomment-3070037202 From kvn at openjdk.org Mon Jul 14 15:39:39 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Jul 2025 15:39:39 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Sun, 13 Jul 2025 08:40:45 GMT, Yadong Wang wrote: >> The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. >> >> C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. >> >> // The assembler store_check code will do an unsigned shift of the oop, >> // then add it to _byte_map_base, i.e. >> // >> // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) >> _byte_map = (CardValue*) rs.base(); >> _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); >> >> In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. >> >> // Card Table Byte Map Base >> operand immByteMapBase() >> %{ >> // Get base of card map >> predicate((jbyte*)n->get_ptr() == >> ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); >> match(ConP); >> >> op_cost(0); >> format %{ %} >> interface(CONST_INTER); >> %} >> >> // Load Byte Map Base Constant >> instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) >> %{ >> match(Set dst con); >> >> ins_cost(INSN_COST); >> format %{ "adr $dst, $con\t# Byte Map Base" %} >> >> ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); >> >> ins_pipe(ialu_imm); >> %} >> >> As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: >> 0xffff25caf08c: ldaxr x8, [x11] >> 0xffff25caf090: cmp x10, x8 >> 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any >> 0xffff25caf098: stlxr w8, x28, [x11] >> 0xffff25caf09c: cbnz w8, 0xffff25caf08c >> 0xffff25caf0a0: orr x11, xzr, #0x3 >> 0xffff25caf0a4: str x11, [x13] >> 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none >> 0xffff25caf0ac: str x14, [sp] >> 0xffff25caf0b0: add x2, sp, #0x20 >> 0xffff25caf0b4: adrp x1, 0xffff21730000 >> 0xffff25caf0b8: bl 0xffff256fffc0 >> 0xffff25caf0bc: ldr x14, [sp] >> 0xffff25caf0c0: b 0xffff25caef80 >> 0xffff25caf0c4: add x13, sp, #0x20 >> 0xffff25caf0c8: adrp x12, 0xffff21730000 >> 0xffff25caf0cc: ldr x10, [x13] >> 0xffff25caf0d0: cmp x10, xzr >> 0xffff25c... > > Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: > > 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding On x86 byte_map_base is handled in GC code: https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp#L314 https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp#L67 Using relocation for byte_map_base is not safe (see comment in `g1BarrierSetAssembler_x86.cpp`). We are "safe" because we bailout AOT code caching if byte_map_base is not relocatable: https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/aotCodeCache.cpp#L338 I have long standing work to use AOTRuntimeConstants table instead for it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3070047198 From mhaessig at openjdk.org Mon Jul 14 15:43:40 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 14 Jul 2025 15:43:40 GMT Subject: RFR: 8360701: Add bailout when the register allocator interference graph grows unreasonably large [v2] In-Reply-To: References: Message-ID: <11yDpTAB7uCCrx5givvBReRWXU4v_VMTQzhJKYMwXR4=.26076fc5-d6a3-4e00-be27-121fb04bce8b@github.com> On Mon, 14 Jul 2025 15:14:22 GMT, Daniel Lund?n wrote: >> The changeset for JDK-8325467 (https://git.openjdk.org/jdk/pull/20404) enables compilation of methods with many parameters, which C2 previously bailed out on. As a side effect, the tests `BigArityTest.java`, `TestCatchExceptionWithVarargs.java`, and `VarargsArrayTest.java` compile more methods than before, and additionally these methods are designed, for stress testing purposes, to have a large number of parameters (at or close to the maximum of 255 parameters allowed by the JVM spec). >> >> Compiling such methods takes a very long time and >99% of the time is spent in the C2 phase Coalesce 2 (part of register allocation). The problem is that the interference graph becomes huge after the initial round of spilling (just before Coalesce 2), and that we do not check for this and bail out if necessary. We do already bail out if the number of IR nodes grows too large, but the interference graph can become huge even if we have a small number of nodes. In fact, the interference graph may (in the worst case) hava a size that is quadratic in the number of nodes. In the problematic tests, we have interference graphs with approximately 100 000 nodes and over 55 000 000 (!) IFG edges. For comparison, the IFG edge count in worst-case realistic scenarios caps out at around 40 000 nodes and 800 000 edges. For example, see the scatter matrix below from running the DaCapo benchmark. It displays, for each time an IFG was built, the number of current IR nodes, the number of live ranges (the actual nodes in the IFG), and the number of IFG edges. >> >> ![dacapo](https://github.com/user-attachments/assets/7a070768-50da-42e4-b5ed-9958e1362673) >> >> ### Changeset >> >> - Add a new diagnostic flag `IFGEdgesLimit` and bail out whenever we reach the number of edges specified by the flag during IFG construction. The default is a very generous 10 000 000 edges, that still filters out the most degenerate compilations we have seen. >> - Add tracking of edges in `PhaseIFG` to permit the new flag. >> >> It is worth noting that it is perhaps preferable to use a lower default than 10 000 000 edges. For example, in standard benchmarks such as DaCapo (see the scatter matrix above), Renaissance, SPECjvm, and SPECjbb, we never go over 1 000 000 edges (I verified this). The reason I went with the generous 10 000 000 limit is that I saw a fair amount of bailouts in testing with the flag set at 1 000 000 edges. Such bailouts are likely motivated, but I do not want to take any chances. Even at 10 000 ... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/c2_globals.hpp > > Co-authored-by: Manuel H?ssig Thank you for elaborating. That makes sense. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26118#pullrequestreview-3016850524 From hgreule at openjdk.org Mon Jul 14 16:22:29 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Mon, 14 Jul 2025 16:22:29 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v5] In-Reply-To: References: Message-ID: > Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. > > Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. > > I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. > > Please review. Thanks. Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: cleanup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25988/files - new: https://git.openjdk.org/jdk/pull/25988/files/f8cc3496..271d162a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25988&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25988&range=03-04 Stats: 7 lines in 1 file changed: 0 ins; 5 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/25988.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25988/head:pull/25988 PR: https://git.openjdk.org/jdk/pull/25988 From mhaessig at openjdk.org Mon Jul 14 16:32:42 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 14 Jul 2025 16:32:42 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v5] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 16:22:29 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > cleanup Thank you for addressing our comments. Looks good. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/25988#pullrequestreview-3016999054 From shade at openjdk.org Mon Jul 14 16:51:43 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Jul 2025 16:51:43 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 11:19:07 GMT, Andrew Dinn wrote: > One other thing. I believe that the only code that creates a RawPtr ConP for the card table base is in the shared C2 card table barrier set assembler. Now that we have moved to late barrier insertion for G1 (along with Shenandoah and Z) Shenandoah still does not do late barrier expansion. But it also does not emit card table bases as constants, it loads card table bases from TLS. G1 would do this with throughput GC barriers soon too, AFAICS. I think Shenandoah also encodes some "unallocated" `ConP` for the similar biased-base trick, e.g. for collection set bitmap, but none of those ever pretend to be oops. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3070267666 From mhaessig at openjdk.org Mon Jul 14 16:13:39 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 14 Jul 2025 16:13:39 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 13:57:09 GMT, Aleksey Shipilev wrote: > See the bug for more analysis. > > The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. > > There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. > > I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. > > This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `tier1` > - [ ] Linux AArch64 server fastdebug, `all` I kicked off a CI run. I'll keep you posted on the results. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26294#issuecomment-3070158657 From kvn at openjdk.org Mon Jul 14 17:18:43 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Jul 2025 17:18:43 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 13:57:09 GMT, Aleksey Shipilev wrote: > See the bug for more analysis. > > The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. > > There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. > > I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. > > This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `tier1` > - [ ] Linux AArch64 server fastdebug, `all` src/hotspot/share/compiler/compileBroker.cpp line 394: > 392: ml.notify_all(); > 393: } > 394: What about other compiler threads which still in process of compiling for blocking tasks? They still need it CompileTask object. `delete_all()` is called by one compiler thread which finished compilation but other threads may not. I don't see any compiler thread checks `shut_down` state to stop compilation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26294#discussion_r2205424761 From adinn at openjdk.org Mon Jul 14 16:10:41 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 14 Jul 2025 16:10:41 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 15:36:41 GMT, Vladimir Kozlov wrote: > On x86 byte_map_base is handled in GC code: Ok, but those two methods cover cases where mov/lea instructions are directly generated into the code stream. Why is there no need for a C2 rule for immByteMapBaee and loadByteMapBase in x86_64.ad. Is it because nothing inserts the card table base into a C2 graph as a ConP node now that we have late barrier generation? If that is the case then I don't think we need the immAOTRuntimeConstantsAddress and loadAOTRCAddress rules in x86_64.ad either. Likewise we can drop the equivalents from aarch64.ad. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3070146912 From kvn at openjdk.org Mon Jul 14 17:20:41 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Jul 2025 17:20:41 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 16:08:15 GMT, Andrew Dinn wrote: > Ok, but those two methods cover cases where mov/lea instructions are directly generated into the code stream. Why is there no need for a C2 rule for immByteMapBaee and loadByteMapBase in x86_64.ad. Is it because nothing inserts the card table base into a C2 graph as a ConP node now that we have late barrier generation? Correct. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3070348566 From sparasa at openjdk.org Mon Jul 14 17:30:42 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 14 Jul 2025 17:30:42 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: On Mon, 14 Jul 2025 08:15:13 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> rename to paired_push and paired_pop > > src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 114: > >> 112: __ paired_push(rax); >> 113: } >> 114: __ paired_push(rcx); > > Hi @vamsi-parasa , for consecutive push/pop we can use push2/pop2 and 16byte alignment can be guaranteed using following technique > https://github.com/openjdk/jdk/pull/25351/files#diff-d5d721ebf93346ba66e81257e4f6c5e6268d59774313c61e97353c0dfbf686a5R94 Hi Jatin (@jatin-bhateja), for the first iteration, would it be ok to get the push_paired/pop_paired changes integrated and then make the push2p/pop2p related optimizations in a separate PR? Thanks, Vamsi ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2205447287 From shade at openjdk.org Mon Jul 14 17:31:40 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Jul 2025 17:31:40 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 17:16:07 GMT, Vladimir Kozlov wrote: >> See the bug for more analysis. >> >> The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. >> >> There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. >> >> I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. >> >> This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. >> >> Additional testing: >> - [x] Linux AArch64 server fastdebug, `tier1` >> - [ ] Linux AArch64 server fastdebug, `all` > > src/hotspot/share/compiler/compileBroker.cpp line 394: > >> 392: ml.notify_all(); >> 393: } >> 394: > > What about other compiler threads which still in process of compiling for blocking tasks? They still need it CompileTask object. > `delete_all()` is called by one compiler thread which finished compilation but other threads may not. > > I don't see any compiler thread checks `shut_down` state to stop compilation. AFAIU, that's the point of the existing protocol to force _waiters_ to delete the task: the blocking waiter would wait for compiler thread to complete the task one way or the other. This PR makes that protocol even stronger: _only_ blocking waiters are allowed to delete the blocking task. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26294#discussion_r2205449314 From kvn at openjdk.org Mon Jul 14 17:31:40 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Jul 2025 17:31:40 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 17:18:10 GMT, Vladimir Kozlov wrote: > > Ok, but those two methods cover cases where mov/lea instructions are directly generated into the code stream. Why is there no need for a C2 rule for immByteMapBaee and loadByteMapBase in x86_64.ad. Is it because nothing inserts the card table base into a C2 graph as a ConP node now that we have late barrier generation? > > Correct. Correction. It is true for G1, Z, Shenandoah. For others we still have constant in C2 IR: https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c2/cardTableBarrierSetC2.cpp#L38 I think we never fully tested AOT with them. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3070382780 From shade at openjdk.org Mon Jul 14 17:37:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 14 Jul 2025 17:37:38 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 17:28:43 GMT, Aleksey Shipilev wrote: >> src/hotspot/share/compiler/compileBroker.cpp line 394: >> >>> 392: ml.notify_all(); >>> 393: } >>> 394: >> >> What about other compiler threads which still in process of compiling for blocking tasks? They still need it CompileTask object. >> `delete_all()` is called by one compiler thread which finished compilation but other threads may not. >> >> I don't see any compiler thread checks `shut_down` state to stop compilation. > > AFAIU, that's the point of the existing protocol to force _waiters_ to delete the task: the blocking waiter would wait for compiler thread to complete the task one way or the other. This PR makes that protocol even stronger: _only_ blocking waiters are allowed to delete the blocking task. Ah, your question is what happens if we notify here, and compilations are still running? Well, I think current protocol should nominally allow waiters to wait until compilation is over and then allow them to delete the task. But then I see `wait_for_compilation` can exit when compilation is shut down: while (!task->is_complete() && !is_compilation_disabled_forever()) { ml.wait(); } This will proceed to delete the task while compiler thread is running. Grrr. Looks to be another hole in this protocol. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26294#discussion_r2205459401 From kvn at openjdk.org Mon Jul 14 17:43:38 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 14 Jul 2025 17:43:38 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 17:34:03 GMT, Aleksey Shipilev wrote: >> AFAIU, that's the point of the existing protocol to force _waiters_ to delete the task: the blocking waiter would wait for compiler thread to complete the task one way or the other. This PR makes that protocol even stronger: _only_ blocking waiters are allowed to delete the blocking task. > > Ah, your question is what happens if we notify here, and compilations are still running? Well, I think current protocol should nominally allow waiters to wait until compilation is over and then allow them to delete the task. But then I see `wait_for_compilation` can exit when compilation is shut down: > > > while (!task->is_complete() && !is_compilation_disabled_forever()) { > ml.wait(); > } > > > This will proceed to delete the task while compiler thread is running. Grrr. Looks to be another hole in this protocol. Can compiler thread delete its **own** blocking task when it finished. And let Java thread resume execution when compilation disabled as it do now but do nothing about task in such case? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26294#discussion_r2205472440 From vpaprotski at openjdk.org Mon Jul 14 17:46:40 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Mon, 14 Jul 2025 17:46:40 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: On Tue, 8 Jul 2025 22:44:55 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > rename to paired_push and paired_pop My concerns have been addressed; thanks Vamsi for changing the names! ------------- Marked as reviewed by vpaprotski (Author). PR Review: https://git.openjdk.org/jdk/pull/25889#pullrequestreview-3017232260 From vpaprotski at openjdk.org Mon Jul 14 17:56:42 2025 From: vpaprotski at openjdk.org (Volodymyr Paprotski) Date: Mon, 14 Jul 2025 17:56:42 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: <9OjuCgSkGnhRtf-nqXBbTu74RaaoySY7JMRaaJ0kaIY=.5575082f-2048-4d03-a8e1-3c23f916f8db@github.com> On Mon, 14 Jul 2025 17:27:35 GMT, Srinivas Vamsi Parasa wrote: >> src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 114: >> >>> 112: __ paired_push(rax); >>> 113: } >>> 114: __ paired_push(rcx); >> >> Hi @vamsi-parasa , for consecutive push/pop we can use push2/pop2 and 16byte alignment can be guaranteed using following technique >> https://github.com/openjdk/jdk/pull/25351/files#diff-d5d721ebf93346ba66e81257e4f6c5e6268d59774313c61e97353c0dfbf686a5R94 > > Hi Jatin (@jatin-bhateja), for the first iteration, would it be ok to get the push_paired/pop_paired changes integrated and then make the push2p/pop2p related optimizations in a separate PR? > > Thanks, > Vamsi I like the current approach, unless we can come up with a very 'visually-low-overhead' way of adding extra alignment (i.e. this current change is across quite a few files, rather not complicate it). At most, perhaps something like `MacroAssembler::push_align()/MacroAssembler::pop_align()`, but I really rather not add more to this PR; It touches quite a few places so I like it being simpler. As it stands, if nothing else, its clear from the `if` statement that existing path is left unmodified. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2205494234 From snatarajan at openjdk.org Mon Jul 14 18:28:57 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Mon, 14 Jul 2025 18:28:57 GMT Subject: RFR: 8342941: IGV: Add new graph dumps for post loop, empty loop removal, and one iteration removal [v4] In-Reply-To: References: Message-ID: > This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). > > Changes: > - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. > - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. > - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. > > Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . > 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` > ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) > 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled > ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) > 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` > ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) > 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` > ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) > > Question to reviewers: > Are the new compiler phases OK, or should we change anything? > > Testing: > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: addressing review comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25756/files - new: https://git.openjdk.org/jdk/pull/25756/files/8e4ca211..37aab41d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25756&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25756&range=02-03 Stats: 12 lines in 4 files changed: 7 ins; 0 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/25756.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25756/head:pull/25756 PR: https://git.openjdk.org/jdk/pull/25756 From snatarajan at openjdk.org Mon Jul 14 18:34:39 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Mon, 14 Jul 2025 18:34:39 GMT Subject: RFR: 8342941: IGV: Add various new graph dumps during loop opts [v3] In-Reply-To: References: <_p5Jj77u1VyyW0eVneXqeNjmngTvSvFi94_FALv6swk=.d4e5aec1-dd73-48ed-8d7f-3080207be763@github.com> <-qvrPep0_75olkxXj9BT74oMIHTfxwgshrHnqQC9BuU=.501e3840-2b5d-4c7c-b2fe-891a167c66d8@github.com> Message-ID: <-Kqb9qaVnwdGmWRs2cR7CGGBMEt-SltMljgv2kR6AAM=.d9cee3d7-ec38-4427-ad31-99ef950afdce@github.com> On Fri, 11 Jul 2025 07:18:17 GMT, Christian Hagedorn wrote: >> src/hotspot/share/opto/phasetype.hpp line 83: >> >>> 81: flags(AFTER_REMOVE_EMPTY_LOOP, "After Remove Empty Loop") \ >>> 82: flags(BEFORE_ONE_ITERATION_LOOP, "Before Replacing One Iteration Loop") \ >>> 83: flags(AFTER_ONE_ITERATION_LOOP, "After Replacing One Iteration Loop") \ >> >> Very much a nit, but I think this should be "One-Iteration Loop". Or, is it in fact one _iteration loop_ (as it reads now)? Looking at the code, I think it is the former. @chhagedorn can maybe clarify? >> >> This is not specific to your changeset, but also appears in existing source code comments. Maybe a good opportunity to clean this up everywhere? >> >> Also, maybe "Replacing" should be "Replace"? Seems to better fit the style used for other phase names. > > One-Iteration loop sounds better indeed. I also agree with the other suggestions. > > Something else I've noticed is that we could also benefit when we add dumps for `duplicate_loop_backedge()` which creates a new loop node (i.e. could be seen as "major modification"). I just looked into recently and found myself adding dumps there manually for debugging. I guess since this is a dump adding RFE, we could also add that one. What do you think? But then we would need to update the PR title to something like "add various new graph dumps during loop opts". Thank you for the comments. I have made the suggested changes to the source code. Attached is a screenshot for the graph dump for `duplicate_loop_backedge() `. I have added the suggested new title. image ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25756#discussion_r2205559207 From snatarajan at openjdk.org Mon Jul 14 18:50:23 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Mon, 14 Jul 2025 18:50:23 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test [v2] In-Reply-To: References: Message-ID: > **Issue** > The last three parameters of `PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word, int mask, int bits, bool return_fast_path)` are unnecessary after the fix introduced in [JDK-8256425](https://bugs.openjdk.org/browse/JDK-8256425) > > **Fix** > The proposed fix removes the last three parameters and makes the necessary modification to the methods. > > **Testing** > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: addressing review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26276/files - new: https://git.openjdk.org/jdk/pull/26276/files/c8164502..1b6be049 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26276&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26276&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26276.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26276/head:pull/26276 PR: https://git.openjdk.org/jdk/pull/26276 From snatarajan at openjdk.org Mon Jul 14 18:50:23 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Mon, 14 Jul 2025 18:50:23 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 07:43:41 GMT, Christian Hagedorn wrote: >> Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: >> >> addressing review comments > > src/hotspot/share/opto/macro.cpp line 98: > >> 96: Node* PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word) { >> 97: Node* cmp; >> 98: cmp = word; > > Could now be merged (I cannot make a direct suggestion due to deleted lines): > > Node* cmp = word; Thank you. I have addressed this in the new commit ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26276#discussion_r2205586068 From duke at openjdk.org Mon Jul 14 20:25:21 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Mon, 14 Jul 2025 20:25:21 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v36] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [ ] Linux x64 fastdebug all > - [ ] Linux aarch64 fastdebug all > - [ ] ... Chad Rakoczy has updated the pull request incrementally with two additional commits since the last revision: - Add nmethod copy constructor - Remove aarch64 trampoline check ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23573/files - new: https://git.openjdk.org/jdk/pull/23573/files/66d73c16..371e1303 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=35 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=34-35 Stats: 77 lines in 2 files changed: 0 ins; 19 del; 58 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From duke at openjdk.org Mon Jul 14 20:34:42 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Mon, 14 Jul 2025 20:34:42 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v37] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This change only slightly modifies existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [ ] Linux x64 fastdebug all > - [ ] Linux aarch64 fastdebug all > - [ ] ... Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: Revert is_always_within_branch_range changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23573/files - new: https://git.openjdk.org/jdk/pull/23573/files/371e1303..36834705 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=36 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=35-36 Stats: 4 lines in 2 files changed: 0 ins; 3 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From duke at openjdk.org Mon Jul 14 20:51:51 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Mon, 14 Jul 2025 20:51:51 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: <85Fw_Bg0OrMd_LYl4PG_VFqFX2QTdcUK-DFOAxzyjIM=.bdbb0c13-f7a3-458f-a61a-004c6eadc1cc@github.com> References: <73AnlXOv0T8K25DgsNdH1PkBjcBXz0f3bBYZx44LpAw=.439f5383-ffd1-44e8-9e11-4b5af9b6a278@github.com> <3f1UnDuYp2iYVcciKF-BqdChOOY2PJJG5R0QuyfblVM=.37a92dfd-5de4-4924-83c5-f9c2e5d7548c@github.com> <85Fw_Bg0OrMd_LYl4PG_VFqFX2QTdcUK-DFOAxzyjIM=.bdbb0c13-f7a3-458f-a61a-004c6eadc1cc@github.com> Message-ID: On Sun, 13 Jul 2025 09:31:45 GMT, Andrew Haley wrote: > In what circumstances would a trampoline be missing? A trampoline could be missing if the nmethod is from JVMCI/Graal. Hotspot decreases the max branch size for debug builds on aarch64 ([source](https://github.com/openjdk/jdk/blob/a10ee46e6dd94a279e0821d431944bb096493664/src/hotspot/cpu/aarch64/assembler_aarch64.hpp#L928-L936)) for stress testing. Since Graal only ever uses the actual max range Hotspot may expect trampolines that Graal has determined aren't actually necessary. However after updating how call sites are fixed ([commit](https://github.com/openjdk/jdk/pull/23573/commits/a6302fdf5754b382702577e8e421c85a5fb9063c)) this code is no longer needed. Since the trampoline relocations are responsible for fixing their owners `CallRelocation::fix_relocation_after_move` no longer needs to perform range checks. So the situation I mentioned above about "missing trampolines" is no longer an issue for relocation ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2205791168 From adinn at openjdk.org Mon Jul 14 21:11:38 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Mon, 14 Jul 2025 21:11:38 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: <-QLF92QDigkY7M3h0GdnV-4J-OHpybRnOiolpdIwQIU=.8b0518d8-7c95-42d5-983a-adf7d0d464c9@github.com> On Mon, 14 Jul 2025 17:29:11 GMT, Vladimir Kozlov wrote: > Correction. It is true for G1, Z, Shenandoah. For others we still have constant in C2 IR Ok, so for Leyden premain we really need to detect a card table base ConP in an x86 C2 graph and generate an external reloc for it -- if, say, we use a generational serial/parallel GC. Likewise we will need to detect and generate external relocs for any ConP node that references an AOTRuntimeConstants field. We don't have to do it using a matching rule. We could instead implement the relevant logic in the encoding for a generic load(ConP) rule or in a macro assembler method called from that encoding. I still personally prefer the idea of distinguishing the relevant cases by detecting a RawPtr vs an OopPtr as being the least intrusive. However, as you say that may still leave us with a card base address that might cause other problems (e.g. it might be zero or a small negative offset). So, let's deal with the backports as Alexei proposed and, when we come to it, implement the premain variants in the encoding or macro assembler rather than via match rules. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3070995960 From dlong at openjdk.org Mon Jul 14 22:58:42 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 14 Jul 2025 22:58:42 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v5] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 16:22:29 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > cleanup Marked as reviewed by dlong (Reviewer). I started a test run for the latest version. Please hold off on integrating until the results are in. ------------- PR Review: https://git.openjdk.org/jdk/pull/25988#pullrequestreview-3018022287 PR Comment: https://git.openjdk.org/jdk/pull/25988#issuecomment-3071272532 From dzhang at openjdk.org Tue Jul 15 01:30:46 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 15 Jul 2025 01:30:46 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v2] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> <5PCmTwnensUBsUNqVlxUuK6L2nDHIOqek7KEH5r_h_M=.9a05eebc-f3ba-4b0e-b0e0-76e89661c89d@github.com> Message-ID: On Mon, 14 Jul 2025 12:19:01 GMT, Feilong Jiang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Adjust the position of comment > > src/hotspot/cpu/riscv/riscv.ad line 1999: > >> 1997: } else if (bt == T_SHORT) { >> 1998: // To support vector type conversions between short and wider types. >> 1999: size = 2; > > Should we add some `assert` or `guarantee` for uncovered types? Thanks for the review! I think assert or guarantee are unnecessary for types not explicitly covered, as their behavior is safely constrained by general rules and global constraints (e.g., limit the min vector size to 8-byte and minimum size of 2). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26239#discussion_r2206096134 From xgong at openjdk.org Tue Jul 15 01:32:46 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 15 Jul 2025 01:32:46 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 13:23:33 GMT, Bhavana Kilambi wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 5990: > >> 5988: %} >> 5989: >> 5990: instruct vmaskwiden_hi_sve(pReg dst, pReg src) %{ > > can both the hi and lo widen rules be combined into a single one as the arguments are the same? or would it make it less understandable? The main problem is that we cannot get the flag of `__is_lo` easily from the relative machnode as far as I know. > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 352: > >> 350: // SVE requires vector indices for gather-load/scatter-store operations >> 351: // on all data types. >> 352: bool Matcher::gather_scatter_needs_vector_index(BasicType bt) { > > There's already a function that tests for `UseSVE > 0` here - https://github.com/openjdk/jdk/blob/bcd86d575fe0682a234228c18b0c2e817d3816da/src/hotspot/cpu/aarch64/matcher_aarch64.hpp#L36 > > Can it be reused? Do you mean directly using `supports_scalable_vector` instead of the new added method in mid-end? I'm afraid we cannot use it. Because on X86, the indexes for subword types are passed with address of the index array, while it's a vector for other types even on AVX-512. But yes, we can call `supports_scalable_vector()` in the new added method for AArch64. > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3430: > >> 3428: >> 3429: instruct vslice_neon(vReg dst, vReg src1, vReg src2, immI index) %{ >> 3430: predicate(VM_Version::use_neon_for_vector(Matcher::vector_length_in_bytes(n))); > > nit: indentation. I think there're 3 spaces here.. Same with the SVE version below. Good catch! I will update it. Thanks a lot! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2206092888 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2206096909 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2206097514 From xgong at openjdk.org Tue Jul 15 02:43:45 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 15 Jul 2025 02:43:45 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 11:17:41 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments to half the number of match rules Thanks for your updating! Overall looks good to me, just with a minor assertion issue in macro assembler. Please see my comment below. src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2861: > 2859: FloatRegister src2, FloatRegister index, > 2860: FloatRegister tmp, unsigned vector_length_in_bytes) { > 2861: assert_different_registers(dst, src1, src2, tmp); It seems `dst` can be the same with either `src1`, `src2`, or `tmp` from following implementation instruction, right? Maybe we should assert more accurate for different cases, such as `src2 == src1 + 1` when `vector_length_in_bytes == 16`? ------------- PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-3018320296 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2206159876 From fyang at openjdk.org Tue Jul 15 03:19:40 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 15 Jul 2025 03:19:40 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v2] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> <5PCmTwnensUBsUNqVlxUuK6L2nDHIOqek7KEH5r_h_M=.9a05eebc-f3ba-4b0e-b0e0-76e89661c89d@github.com> Message-ID: On Tue, 15 Jul 2025 01:27:59 GMT, Dingli Zhang wrote: >> src/hotspot/cpu/riscv/riscv.ad line 1999: >> >>> 1997: } else if (bt == T_SHORT) { >>> 1998: // To support vector type conversions between short and wider types. >>> 1999: size = 2; >> >> Should we add some `assert` or `guarantee` for uncovered types? > > Thanks for the review! > I think assert or guarantee are unnecessary for types not explicitly covered, as their behavior is safely constrained by general rules and global constraints (e.g., limit the min vector size to 8-byte and minimum size of 2). I think it's more clear to make this a switch-case listing the other cases as the default like the aarch64 counterpart. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26239#discussion_r2206193272 From dzhang at openjdk.org Tue Jul 15 04:01:21 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 15 Jul 2025 04:01:21 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: > Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. > So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. > > ### Test > qemu-system UseRVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) > > ### Performance > Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): > > > Benchmark (SIZE) Mode Units Before After Gain > VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 > VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 > VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 > VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 > > PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Use switch-case in min_vector_size ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26239/files - new: https://git.openjdk.org/jdk/pull/26239/files/0773a366..84d6b25f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26239&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26239&range=01-02 Stats: 21 lines in 1 file changed: 8 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/26239.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26239/head:pull/26239 PR: https://git.openjdk.org/jdk/pull/26239 From dzhang at openjdk.org Tue Jul 15 04:01:22 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 15 Jul 2025 04:01:22 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v2] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> <5PCmTwnensUBsUNqVlxUuK6L2nDHIOqek7KEH5r_h_M=.9a05eebc-f3ba-4b0e-b0e0-76e89661c89d@github.com> Message-ID: On Tue, 15 Jul 2025 03:16:17 GMT, Fei Yang wrote: >> Thanks for the review! >> I think assert or guarantee are unnecessary for types not explicitly covered, as their behavior is safely constrained by general rules and global constraints (e.g., limit the min vector size to 8-byte and minimum size of 2). > > I think it's more clear to make this a switch-case listing the other cases as the default like the aarch64 counterpart. Fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26239#discussion_r2206229388 From fjiang at openjdk.org Tue Jul 15 06:01:38 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 15 Jul 2025 06:01:38 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: On Tue, 15 Jul 2025 04:01:21 GMT, Dingli Zhang wrote: >> Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. >> So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. >> >> ### Test >> qemu-system UseRVV: >> * [x] Run jdk_vector (fastdebug) >> * [x] Run compiler/vectorapi (fastdebug) >> >> ### Performance >> Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): >> >> >> Benchmark (SIZE) Mode Units Before After Gain >> VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 >> VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 >> VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 >> VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 >> >> PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Use switch-case in min_vector_size Marked as reviewed by fjiang (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26239#pullrequestreview-3018777723 From thartmann at openjdk.org Tue Jul 15 06:25:47 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Jul 2025 06:25:47 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v5] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 16:22:29 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > cleanup Looks good to me. Also, @dean-long's testing is clean. Ship it! :slightly_smiling_face: ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25988#pullrequestreview-3018843054 From hgreule at openjdk.org Tue Jul 15 06:30:50 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 15 Jul 2025 06:30:50 GMT Subject: RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() [v5] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 16:22:29 GMT, Hannes Greule wrote: >> Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. >> >> Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. >> >> I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. >> >> Please review. Thanks. > > Hannes Greule has updated the pull request incrementally with one additional commit since the last revision: > > cleanup Thank you all for your reviews and comments. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25988#issuecomment-3072174881 From hgreule at openjdk.org Tue Jul 15 06:30:52 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Tue, 15 Jul 2025 06:30:52 GMT Subject: Integrated: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 19:34:40 GMT, Hannes Greule wrote: > Fixes an assertion when passing an int larger than short/char to the corresponding reverseBytes method in a constant-folding scenario. By just using static_cast, we can ignore the upper bytes and just swap the lower bytes. > > Using jasm, I added a test case that covers such inputs. It felt easier to test this way than the other scenarios mentioned in the bug report. > > I also removed the redundant checked_cast calls from the int/long case; we already have the correct type there. > > Please review. Thanks. This pull request has now been integrated. Changeset: e5ab2107 Author: Hannes Greule Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/e5ab210713f76c5307287bd97ce63f9e22d0ab8e Stats: 81 lines in 3 files changed: 71 ins; 0 del; 10 mod 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() Reviewed-by: mhaessig, dlong, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/25988 From thartmann at openjdk.org Tue Jul 15 06:35:22 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Jul 2025 06:35:22 GMT Subject: [jdk25] RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() Message-ID: Hi all, This pull request contains a backport of commit [e5ab2107](https://github.com/openjdk/jdk/commit/e5ab210713f76c5307287bd97ce63f9e22d0ab8e) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. The commit being backported was authored by Hannes Greule on 15 Jul 2025 and was reviewed by Manuel H?ssig, Dean Long and Tobias Hartmann. Thanks! ------------- Commit messages: - Backport e5ab210713f76c5307287bd97ce63f9e22d0ab8e Changes: https://git.openjdk.org/jdk/pull/26308/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26308&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8359678 Stats: 81 lines in 3 files changed: 71 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/26308.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26308/head:pull/26308 PR: https://git.openjdk.org/jdk/pull/26308 From snatarajan at openjdk.org Tue Jul 15 06:38:34 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 15 Jul 2025 06:38:34 GMT Subject: RFR: 8358641: C1 option -XX:+TimeEachLinearScan is broken [v2] In-Reply-To: References: Message-ID: > **Issue** > Using the command` java -Xcomp -XX:TieredStopAtLevel=1 -XX:+TimeEachLinearScan` results in an assert failure in line `assert(_cached_blocks.length() == ir()->linear_scan_order()->length()) failed: invalid cached block list`. > > **Suggestion** > Removal of flag as this is a very old issue > > **Fix** > Removed the flag by removing relevant methods and code while ensuring the removal does not affect other flags. Saranya Natarajan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - addressing review comments - merge master Merge branch 'master' of https://github.com/sarannat/jdk into JDK-8358641 - Initial Fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25933/files - new: https://git.openjdk.org/jdk/pull/25933/files/980f9a50..d66a4ce9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25933&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25933&range=00-01 Stats: 225058 lines in 4104 files changed: 131570 ins; 62711 del; 30777 mod Patch: https://git.openjdk.org/jdk/pull/25933.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25933/head:pull/25933 PR: https://git.openjdk.org/jdk/pull/25933 From xgong at openjdk.org Tue Jul 15 06:45:27 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 15 Jul 2025 06:45:27 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v2] In-Reply-To: References: Message-ID: > This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. > > ### Background > Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. > > ### Implementation > > #### Challenges > Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. > > For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: > - SPECIES_64: Single operation with mask (8 elements, 256-bit) > - SPECIES_128: Single operation, full register (16 elements, 512-bit) > - SPECIES_256: Two operations + merge (32 elements, 1024-bit) > - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) > > Use `ByteVector.SPECIES_512` as an example: > - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. > - It requires 4 times of vector gather-loads to finish the whole operation. > > > byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] > int[] idx = [0, 1, 2, 3, ..., 63, ...] > > 4 gather-load: > idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] > idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] > idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] > idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] > merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] > > > #### Solution > The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. > > Here is the main changes: > - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. > - Added `VectorSliceNode` for result merging. > - Added `VectorMaskWidenNode` for mask spliting and type conversion fo... Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Fix indentation issue and move the helper matcher method to header files - Merge branch jdk:master into JDK-8351623-sve - 8351623: VectorAPI: Add SVE implementation of subword gather load operation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26236/files - new: https://git.openjdk.org/jdk/pull/26236/files/a3db39c3..c39dade2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26236&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26236&range=00-01 Stats: 16304 lines in 537 files changed: 7374 ins; 5177 del; 3753 mod Patch: https://git.openjdk.org/jdk/pull/26236.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236 PR: https://git.openjdk.org/jdk/pull/26236 From thartmann at openjdk.org Tue Jul 15 06:50:47 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Jul 2025 06:50:47 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v13] In-Reply-To: References: <-olTdjMIhNFfAwGbtWC5xswpKbgM_6uPJBgqoL-joJg=.83566f34-801e-449d-b613-dc2f81f40e54@github.com> <260VpdfrxR3vKnrlKQPuVwzJJ3lXM6liDFV4mi-7swg=.eff991a1-0b1f-4fe5-bc31-69896640e654@github.com> Message-ID: On Mon, 14 Jul 2025 15:13:52 GMT, Jatin Bhateja wrote: >>> -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation >> >> Thanks @TobiHartmann , kindly verify with the latest version. > >> @jatin-bhateja This is with the latest version (webrev 13). > > Hi @TobiHartmann I don't see any failure at https://github.com/openjdk/jdk/pull/23947/commits/06eafe7712833d830bbd60cdb729ad261eca59b8 Hi @jatin-bhateja, you are right, my testing missed your latest commit. Re-running. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3072242632 From chagedorn at openjdk.org Tue Jul 15 07:02:47 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 15 Jul 2025 07:02:47 GMT Subject: [jdk25] RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() In-Reply-To: References: Message-ID: <5ee-QeU1KgK8FfKwb2ZCxjTwWR5jwAR4iA3xp5GBnqc=.aec9cf05-1524-498f-a4ab-717a30264a18@github.com> On Tue, 15 Jul 2025 06:29:18 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [e5ab2107](https://github.com/openjdk/jdk/commit/e5ab210713f76c5307287bd97ce63f9e22d0ab8e) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Hannes Greule on 15 Jul 2025 and was reviewed by Manuel H?ssig, Dean Long and Tobias Hartmann. > > Thanks! Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26308#pullrequestreview-3018958682 From thartmann at openjdk.org Tue Jul 15 07:15:40 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Jul 2025 07:15:40 GMT Subject: [jdk25] RFR: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 06:29:18 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [e5ab2107](https://github.com/openjdk/jdk/commit/e5ab210713f76c5307287bd97ce63f9e22d0ab8e) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Hannes Greule on 15 Jul 2025 and was reviewed by Manuel H?ssig, Dean Long and Tobias Hartmann. > > Thanks! Thanks Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26308#issuecomment-3072340552 From fyang at openjdk.org Tue Jul 15 07:25:45 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 15 Jul 2025 07:25:45 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: On Tue, 15 Jul 2025 04:01:21 GMT, Dingli Zhang wrote: >> Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. >> So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. >> >> ### Test >> qemu-system UseRVV: >> * [x] Run jdk_vector (fastdebug) >> * [x] Run compiler/vectorapi (fastdebug) >> >> ### Performance >> Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): >> >> >> Benchmark (SIZE) Mode Units Before After Gain >> VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 >> VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 >> VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 >> VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 >> >> PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Use switch-case in min_vector_size src/hotspot/cpu/riscv/riscv.ad line 1999: > 1997: break; > 1998: case T_SHORT: > 1999: // To support vector type conversions between short and wider types. The code comment doesn't seem to reflect the purpose of this change. Can you improve it adding more details? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26239#discussion_r2206635111 From duke at openjdk.org Tue Jul 15 08:11:28 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Tue, 15 Jul 2025 08:11:28 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v9] In-Reply-To: References: Message-ID: <7od62MdoD83EGfh9UTcLLE1pkkvaZclx2c9sIiLB58M=.066b94a9-e66e-475a-8434-2a6160c7642c@github.com> > The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. > > Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: simplified arrays_hashcode_v() to be closer to VLA and use less general-purpose registers; minor cosmetic changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17413/files - new: https://git.openjdk.org/jdk/pull/17413/files/4e9ad18f..6daaae6e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17413&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17413&range=07-08 Stats: 92 lines in 4 files changed: 13 ins; 34 del; 45 mod Patch: https://git.openjdk.org/jdk/pull/17413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17413/head:pull/17413 PR: https://git.openjdk.org/jdk/pull/17413 From aph at openjdk.org Tue Jul 15 08:22:40 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Jul 2025 08:22:40 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 15:36:41 GMT, Vladimir Kozlov wrote: > On x86 byte_map_base is handled in GC code: https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp#L314 https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp#L67 > > Using relocation for byte_map_base is not safe (see comment in `g1BarrierSetAssembler_x86.cpp`). We are "safe" because we bailout AOT code caching if byte_map_base is not relocatable: https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/aotCodeCache.cpp#L338 Here: // Do not use ExternalAddress to load 'byte_map_base', since 'byte_map_base' is NOT // a valid address and therefore is not properly handled by the relocation code. if (AOTCodeCache::is_on_for_dump()) { // AOT code needs relocation info for this address __ lea(tmp2, ExternalAddress((address)ct->card_table()->byte_map_base())); // tmp2 := card table base address } else { It says "Do not use `ExternalAddress` to load 'byte_map_base'" but then uses `ExternalAddress`. I'm baffled... ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3072625120 From duke at openjdk.org Tue Jul 15 08:31:48 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Tue, 15 Jul 2025 08:31:48 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v9] In-Reply-To: <7od62MdoD83EGfh9UTcLLE1pkkvaZclx2c9sIiLB58M=.066b94a9-e66e-475a-8434-2a6160c7642c@github.com> References: <7od62MdoD83EGfh9UTcLLE1pkkvaZclx2c9sIiLB58M=.066b94a9-e66e-475a-8434-2a6160c7642c@github.com> Message-ID: On Tue, 15 Jul 2025 08:11:28 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: > > simplified arrays_hashcode_v() to be closer to VLA and use less general-purpose registers; minor cosmetic changes bpif3-16g% ( for i in "-XX:DisableIntrinsic=_vectorizedHashCode" "-XX:-UseRVV" "-XX:+UseRVV" ; \ do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java -jar benchmarks.jar \ --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" \ org.openjdk.bench.java.lang.ArraysHashCode.ints \ -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 \ -f 3 -r 1 -w 1 -wi 10 -i 10 2>&1 | tail -15 ) done ) --- -XX:DisableIntrinsic=_vectorizedHashCode --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.273 ? 0.003 ns/op ArraysHashCode.ints 5 avgt 30 28.817 ? 0.013 ns/op ArraysHashCode.ints 10 avgt 30 41.330 ? 0.280 ns/op ArraysHashCode.ints 20 avgt 30 68.236 ? 0.057 ns/op ArraysHashCode.ints 30 avgt 30 88.455 ? 0.142 ns/op ArraysHashCode.ints 40 avgt 30 115.251 ? 0.350 ns/op ArraysHashCode.ints 50 avgt 30 135.525 ? 0.685 ns/op ArraysHashCode.ints 60 avgt 30 161.547 ? 0.165 ns/op ArraysHashCode.ints 70 avgt 30 171.417 ? 0.402 ns/op ArraysHashCode.ints 80 avgt 30 193.232 ? 0.241 ns/op ArraysHashCode.ints 90 avgt 30 207.720 ? 0.304 ns/op ArraysHashCode.ints 100 avgt 30 232.256 ? 0.792 ns/op ArraysHashCode.ints 200 avgt 30 447.408 ? 0.308 ns/op ArraysHashCode.ints 300 avgt 30 656.444 ? 1.332 ns/op --- -XX:-UseRVV --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.279 ? 0.013 ns/op ArraysHashCode.ints 5 avgt 30 24.427 ? 0.005 ns/op ArraysHashCode.ints 10 avgt 30 35.704 ? 0.011 ns/op ArraysHashCode.ints 20 avgt 30 58.894 ? 0.062 ns/op ArraysHashCode.ints 30 avgt 30 82.685 ? 0.015 ns/op ArraysHashCode.ints 40 avgt 30 105.861 ? 0.065 ns/op ArraysHashCode.ints 50 avgt 30 129.672 ? 0.038 ns/op ArraysHashCode.ints 60 avgt 30 152.865 ? 0.057 ns/op ArraysHashCode.ints 70 avgt 30 176.689 ? 0.063 ns/op ArraysHashCode.ints 80 avgt 30 199.823 ? 0.035 ns/op ArraysHashCode.ints 90 avgt 30 223.588 ? 0.046 ns/op ArraysHashCode.ints 100 avgt 30 247.405 ? 0.661 ns/op ArraysHashCode.ints 200 avgt 30 481.698 ? 0.123 ns/op ArraysHashCode.ints 300 avgt 30 716.488 ? 0.104 ns/op --- -XX:+UseRVV --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.276 ? 0.002 ns/op ArraysHashCode.ints 5 avgt 30 22.590 ? 0.039 ns/op ArraysHashCode.ints 10 avgt 30 35.075 ? 0.008 ns/op ArraysHashCode.ints 20 avgt 30 60.142 ? 0.015 ns/op ArraysHashCode.ints 30 avgt 30 85.185 ? 0.020 ns/op ArraysHashCode.ints 40 avgt 30 114.650 ? 1.260 ns/op ArraysHashCode.ints 50 avgt 30 115.520 ? 0.958 ns/op ArraysHashCode.ints 60 avgt 30 113.143 ? 0.416 ns/op ArraysHashCode.ints 70 avgt 30 139.685 ? 0.021 ns/op ArraysHashCode.ints 80 avgt 30 137.792 ? 0.644 ns/op ArraysHashCode.ints 90 avgt 30 139.445 ? 0.458 ns/op ArraysHashCode.ints 100 avgt 30 164.109 ? 0.036 ns/op ArraysHashCode.ints 200 avgt 30 237.400 ? 0.045 ns/op ArraysHashCode.ints 300 avgt 30 318.105 ? 0.562 ns/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3072653412 From duke at openjdk.org Tue Jul 15 08:47:45 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Tue, 15 Jul 2025 08:47:45 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v9] In-Reply-To: References: <7od62MdoD83EGfh9UTcLLE1pkkvaZclx2c9sIiLB58M=.066b94a9-e66e-475a-8434-2a6160c7642c@github.com> Message-ID: <0ltHS3q6Eer8KH_h_TPorujtrPoJHlG6n3mnfv4ZBSY=.56f26d21-98e5-4319-9084-b38122834837@github.com> On Tue, 15 Jul 2025 08:28:51 GMT, Yuri Gaevsky wrote: >> Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: >> >> simplified arrays_hashcode_v() to be closer to VLA and use less general-purpose registers; minor cosmetic changes > > bpif3-16g% ( for i in "-XX:DisableIntrinsic=_vectorizedHashCode" "-XX:-UseRVV" "-XX:+UseRVV" ; \ > do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java -jar benchmarks.jar \ > --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" \ > org.openjdk.bench.java.lang.ArraysHashCode.ints \ > -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 \ > -f 3 -r 1 -w 1 -wi 10 -i 10 2>&1 | tail -15 ) done ) > --- -XX:DisableIntrinsic=_vectorizedHashCode --- > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.ints 1 avgt 30 11.273 ? 0.003 ns/op > ArraysHashCode.ints 5 avgt 30 28.817 ? 0.013 ns/op > ArraysHashCode.ints 10 avgt 30 41.330 ? 0.280 ns/op > ArraysHashCode.ints 20 avgt 30 68.236 ? 0.057 ns/op > ArraysHashCode.ints 30 avgt 30 88.455 ? 0.142 ns/op > ArraysHashCode.ints 40 avgt 30 115.251 ? 0.350 ns/op > ArraysHashCode.ints 50 avgt 30 135.525 ? 0.685 ns/op > ArraysHashCode.ints 60 avgt 30 161.547 ? 0.165 ns/op > ArraysHashCode.ints 70 avgt 30 171.417 ? 0.402 ns/op > ArraysHashCode.ints 80 avgt 30 193.232 ? 0.241 ns/op > ArraysHashCode.ints 90 avgt 30 207.720 ? 0.304 ns/op > ArraysHashCode.ints 100 avgt 30 232.256 ? 0.792 ns/op > ArraysHashCode.ints 200 avgt 30 447.408 ? 0.308 ns/op > ArraysHashCode.ints 300 avgt 30 656.444 ? 1.332 ns/op > --- -XX:-UseRVV --- > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.ints 1 avgt 30 11.279 ? 0.013 ns/op > ArraysHashCode.ints 5 avgt 30 24.427 ? 0.005 ns/op > ArraysHashCode.ints 10 avgt 30 35.704 ? 0.011 ns/op > ArraysHashCode.ints 20 avgt 30 58.894 ? 0.062 ns/op > ArraysHashCode.ints 30 avgt 30 82.685 ? 0.015 ns/op > ArraysHashCode.ints 40 avgt 30 105.861 ? 0.065 ns/op > ArraysHashCode.ints 50 avgt 30 129.672 ? 0.038 ns/op > ArraysHashCode.ints 60 avgt 30 152.865 ? 0.057 ns/op > ArraysHashCode.ints 70 avgt 30 176.689 ? 0.063 ns/op > ArraysHashCode.ints 80 avgt 30 199.823 ? 0.035 ns/op > ArraysHashCode.ints 90 avgt 30 223.588 ? 0.046 ns/op > ArraysHashCode.ints 100 avgt 30 247.405 ? 0.661 ns/op > ArraysHashCode.ints 200 avgt 30 481.698 ? 0.123 ns/op > ArraysHashCode.ints 300 avgt 30 716.488 ? 0.104 ns/op > --- -XX:+UseRVV --- > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.ints 1 avgt 30 ... > > @ygaevsky @RealFYang how can we procced ? > > My apologies, just busy at the moment with other things, going to update the patch soon. Thinking more about suggestions to make the code VLA: I don't understand how to break the dependency on `result` calculation? It depends on previous `result`: ``` result = 31^^M * result + 31^^(M-1) * val[i+0] + 32^^0 * val[i+(M-1)]; ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3072726308 From shade at openjdk.org Tue Jul 15 08:59:17 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Jul 2025 08:59:17 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 [v2] In-Reply-To: References: Message-ID: > See the bug for more analysis. > > The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. > > There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. > > I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. > > This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `tier1` > - [ ] Linux AArch64 server fastdebug, `all` Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Also handle the corner case when compiler threads might be using the task ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26294/files - new: https://git.openjdk.org/jdk/pull/26294/files/13625998..76bfa8d1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26294&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26294&range=00-01 Stats: 18 lines in 2 files changed: 7 ins; 2 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/26294.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26294/head:pull/26294 PR: https://git.openjdk.org/jdk/pull/26294 From shade at openjdk.org Tue Jul 15 08:59:17 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Jul 2025 08:59:17 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 17:40:56 GMT, Vladimir Kozlov wrote: >> Ah, your question is what happens if we notify here, and compilations are still running? Well, I think current protocol should nominally allow waiters to wait until compilation is over and then allow them to delete the task. But then I see `wait_for_compilation` can exit when compilation is shut down: >> >> >> while (!task->is_complete() && !is_compilation_disabled_forever()) { >> ml.wait(); >> } >> >> >> This will proceed to delete the task while compiler thread is running. Grrr. Looks to be another hole in this protocol. > > Can compiler thread delete its **own** blocking task when it finished. And let Java thread resume execution when compilation disabled as it do now but do nothing about task in such case? I don't think that works. There is no "own" blocking task, there are nearly always two threads involved: the compiler thread and the waiter (Java) thread. Waiter is checking the task status under the lock. Logically, the last _user_ should delete the task, that is waiter. But I think we can handle this hole by ignoring the blocking task deletion during compiler shutdown. For the same reason described in PR body: we already leave cruft behind in that case, and it costs us quite a bit of complexity to deal with every corner case during shutdown. So it seems simpler to just drop the tasks on the floor in that corner case. I did a variant of this in new commit, seems to still work well under stress testing. More testing is running now... ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26294#discussion_r2206893603 From dlunden at openjdk.org Tue Jul 15 09:11:45 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 15 Jul 2025 09:11:45 GMT Subject: RFR: 8342941: IGV: Add various new graph dumps during loop opts [v4] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 18:28:57 GMT, Saranya Natarajan wrote: >> This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). >> >> Changes: >> - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. >> - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. >> - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. >> >> Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . >> 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` >> ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) >> 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled >> ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) >> 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` >> ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) >> 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` >> ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) >> >> Question to reviewers: >> Are the new compiler phases OK, or should we change anything? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > addressing review comment test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 94: > 92: AFTER_REMOVE_EMPTY_LOOP( "After Remove Empty Loop"), > 93: BEFORE_ONE_ITERATION_LOOP( "Before Replacing One Iteration Loop"), > 94: AFTER_ONE_ITERATION_LOOP( "After Replacing One Iteration Loop"), Suggestion: BEFORE_ONE_ITERATION_LOOP( "Before Replacing One-Iteration Loop"), AFTER_ONE_ITERATION_LOOP( "After Replacing One-Iteration Loop"), I see there are also a few more occurrences that I think needs updating (`grep` for "one iteration loop") ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25756#discussion_r2206928409 From bmaillard at openjdk.org Tue Jul 15 10:23:26 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 15 Jul 2025 10:23:26 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods Message-ID: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. ## Analysis We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: ```c++ if (!ci_env.failing() && !task->is_success()) { assert(ci_env.failure_reason() != nullptr, "expect failure reason"); assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); // The compiler elected, without comment, not to register a result. // Do not attempt further compilations of this method. ci_env.record_method_not_compilable("compile failed"); } The `task->is_success()` call accesses the private `_is_success` field. This field is modified in `CompileTask::mark_success`. By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: CompileTask::mark_success compileTask.hpp:185 nmethod::post_compiled_method nmethod.cpp:2212 ciEnv::register_method ciEnv.cpp:1127 Compilation::install_code c1_Compilation.cpp:425 Compilation::compile_method c1_Compilation.cpp:488 Compilation::Compilation c1_Compilation.cpp:609 Compiler::compile_method c1_Compiler.cpp:262 CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 CompileBroker::compiler_thread_loop compileBroker.cpp:1968 CompilerThread::thread_entry compilerThread.cpp:67 JavaThread::thread_main_inner javaThread.cpp:773 JavaThread::run javaThread.cpp:758 Thread::call_run thread.cpp:243 thread_native_entry os_linux.cpp:868 We go up the stacktrace and see that in `Compilation::compile_method` we have: ```c++ if (should_install_code()) { // install code PhaseTraceTime timeit(_t_codeinstall); install_code(frame_size); } If we do not install methods after compilation, the code path that marks the success is never executed and therefore results in hitting the assert. ### Fix We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. ### Testing - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) - [ ] tier1-3, plus some internal testing - [x] Added a test that starts the VM with the `-XX:-InstallMethods` flag Thank you for reviewing! ------------- Commit messages: - 8358573: Add test for -XX:-InstallMethods - 8358573: Add missing task success notification Changes: https://git.openjdk.org/jdk/pull/26310/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26310&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8358573 Stats: 50 lines in 2 files changed: 50 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26310.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26310/head:pull/26310 PR: https://git.openjdk.org/jdk/pull/26310 From dzhang at openjdk.org Tue Jul 15 10:28:56 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Tue, 15 Jul 2025 10:28:56 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: > Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. > So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. > > ### Test > qemu-system UseRVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) > > ### Performance > Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): > > > Benchmark (SIZE) Mode Units Before After Gain > VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 > VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 > VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 > VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 > > PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Update comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26239/files - new: https://git.openjdk.org/jdk/pull/26239/files/84d6b25f..7120525b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26239&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26239&range=02-03 Stats: 6 lines in 1 file changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/26239.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26239/head:pull/26239 PR: https://git.openjdk.org/jdk/pull/26239 From bulasevich at openjdk.org Tue Jul 15 10:29:03 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 15 Jul 2025 10:29:03 GMT Subject: RFR: 8362250: ARM32: forward_exception_entry missing return address Message-ID: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> The ARM32 ForwardExceptionNode codegen needs to set the exception address to R5. And, since the https://github.com/openjdk/jdk/pull/20437 change, the TailCall generator does not need this because the StubRoutines::forward_exception_entry function is not called there. ------------- Commit messages: - 8362250: ARM32: forward_exception_entry missing return address Changes: https://git.openjdk.org/jdk/pull/26312/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26312&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362250 Stats: 5 lines in 1 file changed: 1 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26312.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26312/head:pull/26312 PR: https://git.openjdk.org/jdk/pull/26312 From aph at openjdk.org Tue Jul 15 11:02:47 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Jul 2025 11:02:47 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet In-Reply-To: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: <02gzu7-f5QW8kKcXb2WROl3k1rdsNcM4q1sofUHZ6oE=.5d9898c4-0ea2-4dcb-9dbb-eabaffb93b91@github.com> On Thu, 26 Jun 2025 12:13:19 GMT, Samuel Chee wrote: > AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg: > > ;; cmpxchg { > 0x0000e708d144cf60: mov x8, x2 > 0x0000e708d144cf64: casal x8, x3, [x0] > 0x0000e708d144cf68: cmp x8, x2 > ;; 0x1F1F1F1F1F1F1F1F > 0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f > ;; } cmpxchg > 0x0000e708d144cf70: cset x8, ne // ne = any > 0x0000e708d144cf74: dmb ish > > > According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2] > >> Atomically sets the value of a variable to the >> newValue with the memory semantics of setVolatile if >> the variable's current value, referred to as the witness >> value, == the expectedValue, as accessed with the memory >> semantics of getVolatile. > > > > Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen. > > Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant) > > This is also reflected by C2 not having a dmb for the same respective method. > > [1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) > [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...) With the help of Will Deacon, one of the authors of the memory model. I've now got to the bottom of this. It is indeed a change to the MM, dating from 2002. I agree that the DMB isn't needed here because the CASAL has both acquire and release semantics. However, I don't think that's related to the snippet of the architecture you have above but rather comes from: // DDI0487L_b // Barrier-ordered-before (B2-255) ... * All of the following apply: - E1 is an Explicit Memory Write Effect and is generated by an atomic instruction with both Acquire and Release semantics. - E1 appears in program order before E2. - One of the following applies: - E2 is an Explicit Memory Effect. - E2 is an Implicit Tag Memory Read Effect. - E2 is an MMU Fault Effect. Which says that the release store of the CASAL is ordered before the the subsequent store to y. Note that this _wouldn't_ work if you used CASL instead. The full details of the MM change are here: http://github.com/herd/herdtools7/commit/636b7163c0679c691b8cf9a04623cd3aa1cc0ec3 So, this change looks good, and we can remove trailing DMBs from most CASALs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3073154641 PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3073156671 From aph-open at littlepinkcloud.com Tue Jul 15 11:04:10 2025 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Tue, 15 Jul 2025 12:04:10 +0100 Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet In-Reply-To: <02gzu7-f5QW8kKcXb2WROl3k1rdsNcM4q1sofUHZ6oE=.5d9898c4-0ea2-4dcb-9dbb-eabaffb93b91@github.com> References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> <02gzu7-f5QW8kKcXb2WROl3k1rdsNcM4q1sofUHZ6oE=.5d9898c4-0ea2-4dcb-9dbb-eabaffb93b91@github.com> Message-ID: On 15/07/2025 12:02, Andrew Haley wrote: > It is indeed a change to the MM, dating from 2002. 2022, obvs. From adinn at openjdk.org Tue Jul 15 11:23:39 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 15 Jul 2025 11:23:39 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: <8NL_uVAlbHrnK9t1Ec89Uk100mo0ADe-_ni9b7QXQss=.39f24034-b4d5-4b25-a2a7-d3930a50730f@github.com> On Tue, 15 Jul 2025 08:19:59 GMT, Andrew Haley wrote: > It says "Do not use ExternalAddress to load 'byte_map_base'" but then uses ExternalAddress. I'm baffled... You need to read it as a FIXME ;-) The current workaround that we have prototyped is to place the base 'address' in a field in the global AOTRuntimeConstants instance and load it via that field. That's not very attractive because we need to do an indirect load from an lea'd constant address (movz/movk/movk/ldr) at every occurrence of a GC barrier. We also need to relocate every mov sequence when we load the code from the archive. What we would like longer term is to store the address in a method/stub's constants section and use a pc-relative load to access it. We would need to provide the constant entry with a relocation so we can reinit it to the VM's current base at AOT code load but we can use the same constant for every occurrence of the barrier. That also means we update less pages during reloc which means we should eventually be able to rely on mmaping AOT cocde cache pages with limited copy on write for reloced insns. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3073218378 From bkilambi at openjdk.org Tue Jul 15 11:32:46 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 15 Jul 2025 11:32:46 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 02:36:45 GMT, Xiaohong Gong wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments to half the number of match rules > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2861: > >> 2859: FloatRegister src2, FloatRegister index, >> 2860: FloatRegister tmp, unsigned vector_length_in_bytes) { >> 2861: assert_different_registers(dst, src1, src2, tmp); > > It seems `dst` can be the same with either `src1`, `src2`, or `tmp` from following implementation instruction, right? Maybe we should assert more accurate for different cases, such as `src2 == src1 + 1` when `vector_length_in_bytes == 16`? `dst, src1, src2` and `tmp` need to be different registers. Only `dst` and `index` can match depending on the type of the input. The reason why I didn't add `index` to the assertion. for the `src2 == src1 + 1` case, this is being checked in the definition of the `tbl` instruction for SVE in `src/hotspot/cpu/aarch64/assembler_aarch64.hpp` (but I realized it's not for Neon). Do you think it's enough if I make the Neon `tbl` instruction definition compatible with the SVE one? Or it's better to add a separate assertion here as well? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2207226076 From thartmann at openjdk.org Tue Jul 15 11:38:49 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Jul 2025 11:38:49 GMT Subject: [jdk25] Integrated: 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 06:29:18 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [e5ab2107](https://github.com/openjdk/jdk/commit/e5ab210713f76c5307287bd97ce63f9e22d0ab8e) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Hannes Greule on 15 Jul 2025 and was reviewed by Manuel H?ssig, Dean Long and Tobias Hartmann. > > Thanks! This pull request has now been integrated. Changeset: 7aa3f317 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/7aa3f31724844bf2f4e08111af8173b5d985f809 Stats: 81 lines in 3 files changed: 71 ins; 0 del; 10 mod 8359678: C2: assert(static_cast(result) == thing) caused by ReverseBytesNode::Value() Reviewed-by: chagedorn Backport-of: e5ab210713f76c5307287bd97ce63f9e22d0ab8e ------------- PR: https://git.openjdk.org/jdk/pull/26308 From shade at openjdk.org Tue Jul 15 11:38:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Jul 2025 11:38:41 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: <2WKDMaaFc9Hg9wAKR2_tRwmCKD6zrt-BZQ-3UXUYEPs=.acacec4e-7ccd-43d7-986d-50c06d805f37@github.com> On Sun, 13 Jul 2025 08:40:45 GMT, Yadong Wang wrote: >> The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. >> >> C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. >> >> // The assembler store_check code will do an unsigned shift of the oop, >> // then add it to _byte_map_base, i.e. >> // >> // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) >> _byte_map = (CardValue*) rs.base(); >> _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); >> >> In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. >> >> // Card Table Byte Map Base >> operand immByteMapBase() >> %{ >> // Get base of card map >> predicate((jbyte*)n->get_ptr() == >> ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); >> match(ConP); >> >> op_cost(0); >> format %{ %} >> interface(CONST_INTER); >> %} >> >> // Load Byte Map Base Constant >> instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) >> %{ >> match(Set dst con); >> >> ins_cost(INSN_COST); >> format %{ "adr $dst, $con\t# Byte Map Base" %} >> >> ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); >> >> ins_pipe(ialu_imm); >> %} >> >> As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: >> 0xffff25caf08c: ldaxr x8, [x11] >> 0xffff25caf090: cmp x10, x8 >> 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any >> 0xffff25caf098: stlxr w8, x28, [x11] >> 0xffff25caf09c: cbnz w8, 0xffff25caf08c >> 0xffff25caf0a0: orr x11, xzr, #0x3 >> 0xffff25caf0a4: str x11, [x13] >> 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none >> 0xffff25caf0ac: str x14, [sp] >> 0xffff25caf0b0: add x2, sp, #0x20 >> 0xffff25caf0b4: adrp x1, 0xffff21730000 >> 0xffff25caf0b8: bl 0xffff256fffc0 >> 0xffff25caf0bc: ldr x14, [sp] >> 0xffff25caf0c0: b 0xffff25caef80 >> 0xffff25caf0c4: add x13, sp, #0x20 >> 0xffff25caf0c8: adrp x12, 0xffff21730000 >> 0xffff25caf0cc: ldr x10, [x13] >> 0xffff25caf0d0: cmp x10, xzr >> 0xffff25c... > > Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: > > 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding I read this that we are in consensus that removing the bad rule is the way to go. I ran Linux AArch64 server fastdebug `make test TEST=all` on Graviton 3 and seen no new problems. Therefore, I am approving this to get the reviewer train going. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26249#pullrequestreview-3019924039 From mhaessig at openjdk.org Tue Jul 15 11:55:41 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Tue, 15 Jul 2025 11:55:41 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 16:11:15 GMT, Manuel H?ssig wrote: > I kicked off a CI run. FWIW, tier1-tier3, and 100 repeats of `TestStressBailout.java` on Linux x64 & aarch64, Windows x64, and Mac aarch64 all passed. Let me know when I should kick off another round. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26294#issuecomment-3073304815 From snatarajan at openjdk.org Tue Jul 15 12:09:31 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 15 Jul 2025 12:09:31 GMT Subject: RFR: 8342941: IGV: Add various new graph dumps during loop opts [v5] In-Reply-To: References: Message-ID: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> > This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). > > Changes: > - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. > - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. > - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. > > Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . > 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` > ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) > 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled > ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) > 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` > ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) > 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` > ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) > > Question to reviewers: > Are the new compiler phases OK, or should we change anything? > > Testing: > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: modifying one iteration loop to one-iteration ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25756/files - new: https://git.openjdk.org/jdk/pull/25756/files/37aab41d..4f531f1a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25756&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25756&range=03-04 Stats: 6 lines in 4 files changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/25756.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25756/head:pull/25756 PR: https://git.openjdk.org/jdk/pull/25756 From snatarajan at openjdk.org Tue Jul 15 12:09:34 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 15 Jul 2025 12:09:34 GMT Subject: RFR: 8342941: IGV: Add various new graph dumps during loop opts [v4] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 09:08:45 GMT, Daniel Lund?n wrote: >> Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: >> >> addressing review comment > > test/hotspot/jtreg/compiler/lib/ir_framework/CompilePhase.java line 94: > >> 92: AFTER_REMOVE_EMPTY_LOOP( "After Remove Empty Loop"), >> 93: BEFORE_ONE_ITERATION_LOOP( "Before Replacing One Iteration Loop"), >> 94: AFTER_ONE_ITERATION_LOOP( "After Replacing One Iteration Loop"), > > Suggestion: > > BEFORE_ONE_ITERATION_LOOP( "Before Replacing One-Iteration Loop"), > AFTER_ONE_ITERATION_LOOP( "After Replacing One-Iteration Loop"), > > > I see there are also a few more occurrences that I think needs updating (`grep` for "one iteration loop") Sorry about this. I have made the changes. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25756#discussion_r2207302326 From aph at openjdk.org Tue Jul 15 12:11:44 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Jul 2025 12:11:44 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: <8NL_uVAlbHrnK9t1Ec89Uk100mo0ADe-_ni9b7QXQss=.39f24034-b4d5-4b25-a2a7-d3930a50730f@github.com> References: <8NL_uVAlbHrnK9t1Ec89Uk100mo0ADe-_ni9b7QXQss=.39f24034-b4d5-4b25-a2a7-d3930a50730f@github.com> Message-ID: On Tue, 15 Jul 2025 11:21:29 GMT, Andrew Dinn wrote: > > It says "Do not use ExternalAddress to load 'byte_map_base'" but then uses ExternalAddress. I'm baffled... > > You need to read it as a FIXME ;-) Aha! > The current workaround that we have prototyped is to place the base 'address' in a field in the global AOTRuntimeConstants instance and load it via that field. That's not very attractive because we need to do an indirect load from an lea'd constant address (movz/movk/movk/ldr) at every occurrence of a GC barrier. We also need to relocate every mov sequence when we load the code from the archive. Eww. > What we would like longer term is to store the address in a method/stub's constants section and use a pc-relative load to access it. IMHO that doesn't help very much. You're still looking at ~4 cycles to load from L1 dcache, and you kill a dcache line for it, and you consume a full xword of code space for it. If we put the byte-map base on a 32-bit boundary we only need a single MOVZ, and that can be relocated easily enough when we load the code from the archive. Surely that's better. Or is even that small work too much? We would need to provide the constant entry with a relocation so we can reinit it to the VM's current base at AOT code load but we can use the same constant for every occurrence of the barrier. That also means we update less pages during reloc which means we should eventually be able to rely on mmaping AOT cocde cache pages with limited copy on write for reloced insns. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3073351728 From aph at openjdk.org Tue Jul 15 12:19:43 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Jul 2025 12:19:43 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 11:28:17 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2861: >> >>> 2859: FloatRegister src2, FloatRegister index, >>> 2860: FloatRegister tmp, unsigned vector_length_in_bytes) { >>> 2861: assert_different_registers(dst, src1, src2, tmp); >> >> It seems `dst` can be the same with either `src1`, `src2`, or `tmp` from following implementation instruction, right? Maybe we should assert more accurate for different cases, such as `src2 == src1 + 1` when `vector_length_in_bytes == 16`? > > `dst, src1, src2` and `tmp` need to be different registers. Only `dst` and `index` can match depending on the type of the input. The reason why I didn't add `index` to the assertion. > for the `src2 == src1 + 1` case, this is being checked in the definition of the `tbl` instruction for SVE in `src/hotspot/cpu/aarch64/assembler_aarch64.hpp` (but I realized it's not for Neon). Do you think it's enough if I make the Neon `tbl` instruction definition compatible with the SVE one? Or it's better to add a separate assertion here as well? Just assert at the start of every function whatever that function needs. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2207321740 From aph at openjdk.org Tue Jul 15 12:19:45 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Jul 2025 12:19:45 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 11:17:41 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments to half the number of match rules src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2912: > 2910: // If control reaches here, then the Neon instructions would be executed and > 2911: // one of these conditions must satisfy - > 2912: // UseSVE == 0 || (UseSVE == 1 && length_in_bytes == 16) Why? Can't you make this logic correct regardless of `UseSVE`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2207319386 From duke at openjdk.org Tue Jul 15 12:23:39 2025 From: duke at openjdk.org (Samuel Chee) Date: Tue, 15 Jul 2025 12:23:39 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet In-Reply-To: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 26 Jun 2025 12:13:19 GMT, Samuel Chee wrote: > AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg: > > ;; cmpxchg { > 0x0000e708d144cf60: mov x8, x2 > 0x0000e708d144cf64: casal x8, x3, [x0] > 0x0000e708d144cf68: cmp x8, x2 > ;; 0x1F1F1F1F1F1F1F1F > 0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f > ;; } cmpxchg > 0x0000e708d144cf70: cset x8, ne // ne = any > 0x0000e708d144cf74: dmb ish > > > According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2] > >> Atomically sets the value of a variable to the >> newValue with the memory semantics of setVolatile if >> the variable's current value, referred to as the witness >> value, == the expectedValue, as accessed with the memory >> semantics of getVolatile. > > > > Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen. > > Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant) > > This is also reflected by C2 not having a dmb for the same respective method. > > [1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) > [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...) Great! Glad you're convinced of its correctness. > we can remove trailing DMBs from most CASALs. Just do bear in mind that CASAL doesn't emit release semantics if the compare fails so I imagine there might be cases where a trailing dmb might still necessary. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3073385919 From adinn at openjdk.org Tue Jul 15 12:32:43 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 15 Jul 2025 12:32:43 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Sun, 13 Jul 2025 08:40:45 GMT, Yadong Wang wrote: >> The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. >> >> C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. >> >> // The assembler store_check code will do an unsigned shift of the oop, >> // then add it to _byte_map_base, i.e. >> // >> // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) >> _byte_map = (CardValue*) rs.base(); >> _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); >> >> In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. >> >> // Card Table Byte Map Base >> operand immByteMapBase() >> %{ >> // Get base of card map >> predicate((jbyte*)n->get_ptr() == >> ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); >> match(ConP); >> >> op_cost(0); >> format %{ %} >> interface(CONST_INTER); >> %} >> >> // Load Byte Map Base Constant >> instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) >> %{ >> match(Set dst con); >> >> ins_cost(INSN_COST); >> format %{ "adr $dst, $con\t# Byte Map Base" %} >> >> ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); >> >> ins_pipe(ialu_imm); >> %} >> >> As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: >> 0xffff25caf08c: ldaxr x8, [x11] >> 0xffff25caf090: cmp x10, x8 >> 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any >> 0xffff25caf098: stlxr w8, x28, [x11] >> 0xffff25caf09c: cbnz w8, 0xffff25caf08c >> 0xffff25caf0a0: orr x11, xzr, #0x3 >> 0xffff25caf0a4: str x11, [x13] >> 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none >> 0xffff25caf0ac: str x14, [sp] >> 0xffff25caf0b0: add x2, sp, #0x20 >> 0xffff25caf0b4: adrp x1, 0xffff21730000 >> 0xffff25caf0b8: bl 0xffff256fffc0 >> 0xffff25caf0bc: ldr x14, [sp] >> 0xffff25caf0c0: b 0xffff25caef80 >> 0xffff25caf0c4: add x13, sp, #0x20 >> 0xffff25caf0c8: adrp x12, 0xffff21730000 >> 0xffff25caf0cc: ldr x10, [x13] >> 0xffff25caf0d0: cmp x10, xzr >> 0xffff25c... > > Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: > > 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding Yes, let's start the ball rolling. ------------- Marked as reviewed by adinn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26249#pullrequestreview-3020078780 From adinn at openjdk.org Tue Jul 15 12:32:45 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 15 Jul 2025 12:32:45 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: <8NL_uVAlbHrnK9t1Ec89Uk100mo0ADe-_ni9b7QXQss=.39f24034-b4d5-4b25-a2a7-d3930a50730f@github.com> Message-ID: On Tue, 15 Jul 2025 12:08:41 GMT, Andrew Haley wrote: > IMHO that doesn't help very much. You're still looking at ~4 cycles to load from L1 dcache, and you kill a dcache line for it, and you consume a full xword of code space for it. If we put the byte-map base on a 32-bit boundary we only need a single MOVZ, and that can be relocated easily enough when we load the code from the archive. Surely that's better. Or is even that small work too much? Well, it helps a bit even though it suffers from the faults you identify. The downside of using even a single MOVZ is that every on requries a reloc during AOT code reloading. So, the number of relocs in any code blob that we handle during loading is no longer 1. Instead it equals the number of post-barriers in the blob. The bigger hit will come when we try to optimize code loading by mmapping AOT code blobs into the code cache -- at present we copy-relocate it from an mmapped region. Instead of just one constant to patch at the start of the blob we will have many places to patch scattered throughout the code blob. So, many more copy-on-write pages rather than vanilla mapped pages. That drags back in copy overheads that the mmap is intended to avoid and also means less opportunities for co-hosted JVMs to share pages. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3073410591 From bkilambi at openjdk.org Tue Jul 15 12:33:44 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 15 Jul 2025 12:33:44 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 12:16:42 GMT, Andrew Haley wrote: >> `dst, src1, src2` and `tmp` need to be different registers. Only `dst` and `index` can match depending on the type of the input. The reason why I didn't add `index` to the assertion. >> for the `src2 == src1 + 1` case, this is being checked in the definition of the `tbl` instruction for SVE in `src/hotspot/cpu/aarch64/assembler_aarch64.hpp` (but I realized it's not for Neon). Do you think it's enough if I make the Neon `tbl` instruction definition compatible with the SVE one? Or it's better to add a separate assertion here as well? > > Just assert at the start of every function whatever that function needs. I'll add the `src2 == src1 + 1` assertion in my next PS. Thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2207350457 From bkilambi at openjdk.org Tue Jul 15 12:33:45 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 15 Jul 2025 12:33:45 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 12:15:27 GMT, Andrew Haley wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments to half the number of match rules > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2912: > >> 2910: // If control reaches here, then the Neon instructions would be executed and >> 2911: // one of these conditions must satisfy - >> 2912: // UseSVE == 0 || (UseSVE == 1 && length_in_bytes == 16) > > Why? Can't you make this logic correct regardless of `UseSVE`? So the Neon implementation gets kicked in when SVE is not available (UseSVE == 0) whether the vector length is 8 or 16 but we emit Neon instructions for UseSVE ==1 and vector length == 16 only. I am not sure how I can eliminate `UseSVE` here. When the vector length == 8 with SVE1, we generate the SVE `tbl` instruction (with single input). This is done for `T_INT` and `T_FLOAT` types so that we avoid generating the `mulv`/`addv` instructions for the Neon `tbl` instruction. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2207349617 From dhanalla at openjdk.org Tue Jul 15 12:36:44 2025 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Tue, 15 Jul 2025 12:36:44 GMT Subject: RFR: 8341293: Split field loads through Nested Phis [v9] In-Reply-To: References: Message-ID: On Fri, 25 Apr 2025 15:04:06 GMT, Dhamoder Nalla wrote: >> Dhamoder Nalla has updated the pull request incrementally with one additional commit since the last revision: >> >> address CR comments > >> Looks like you deleted some of the tests you had there. Can you explain why? This was not by any chance the one that failed? [3c56f98](https://github.com/openjdk/jdk/commit/3c56f98d88433a4fada2c7e43147fc2e91df5e89) > > Thanks @eme64 for taking a look at it, > I haven't deleted any tests; the test cases list below remains unchanged. The assertion is addressed by the code change https://github.com/openjdk/jdk/commit/3c56f98d88433a4fada2c7e43147fc2e91df5e89#diff-03f7ae3cf79ff61be6e4f0590b7809a87825b073341fdbfcf36143b99c304474L4523 > > try { > Asserts.assertEQ(testRematerialize_SingleObj_Interp(cond1, x, y), testRematerialize_SingleObj_C2(cond1, x, y)); > } catch (Exception e) {} > Asserts.assertEQ(testRematerialize_TryCatch_Interp(cond1, l, x, y), testRematerialize_TryCatch_C2(cond1, l, x, y)); > Asserts.assertEQ(testMerge_TryCatchFinally_Interp(cond1, l, x, y), testMerge_TryCatchFinally_C2(cond1, l, x, y)); > Asserts.assertEQ(testRematerialize_MultiObj_Interp(cond1, cond2, x, y), testRematerialize_MultiObj_C2(cond1, cond2, x, y)); > Asserts.assertEQ(testGlobalEscapeInThread_Intrep(cond1, l, x, y), testGlobalEscapeInThread_C2(cond1, l, x, y)); > Asserts.assertEQ(testGlobalEscapeInThreadWithSync_Intrep(cond1, x, y), testGlobalEscapeInThreadWithSync_C2(cond1, x, y)); > Asserts.assertEQ(testFieldEscapeWithMerge_Intrep(cond1, x, y), testFieldEscapeWithMerge_C2(cond1, x, y)); > Asserts.assertEQ(testNestedPhi_FieldLoad_Interp(cond1, cond2, x, y), testNestedPhi_FieldLoad_C2(cond1, cond2, x, y)); > Asserts.assertEQ(testThreeLevelNestedPhi_Interp(cond1, cond2, x, y), testThreeLevelNestedPhi_C2(cond1, cond2, x, y)); > Asserts.assertEQ(testNestedPhiProcessOrder_Interp(cond1, cond2, x, y), testNestedPhiProcessOrder_C2(cond1, cond2, x, y)); > Asserts.assertEQ(testNestedPhi_TryCatch_Interp(cond1, cond2, x, y), testNestedPhi_TryCatch_C2(cond1, cond2, x, y)); > Asserts.assertEQ(testBailOut_Interp(cond1, cond2, x, y), testBailOut_C2(cond1, cond2, x, y)); > Asserts.assertEQ(testNestedPhiPolymorphic_Interp(cond1, cond2, x, y), testNestedPhiPolymorphic_C2(cond1, cond2, x, y)); > Asserts.assertEQ(testNestedPhiWithTrap_Interp(cond1, cond2, x, y), testNestedPhiWithTrap_C2(cond1, cond2, x, y)); > Asserts.assertEQ(testNestedPhiWithLambda_Interp(cond1, cond2, x, y), tes... > Thank you for persisting on this @dhanalla . I just did a quick look. I'll look again and run tests as soon as I get some time. Thank you @JohnTortugo for reviewing this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21270#issuecomment-3073421226 From dhanalla at openjdk.org Tue Jul 15 12:36:45 2025 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Tue, 15 Jul 2025 12:36:45 GMT Subject: RFR: 8341293: Split field loads through Nested Phis [v9] In-Reply-To: References: Message-ID: <3o_BE9OqAVt924PEQfnUY7rc4ukgHtDZneKHEbP0xdo=.46beab4b-455d-42fb-8f5c-0a12a0854a48@github.com> On Tue, 17 Jun 2025 19:13:04 GMT, Cesar Soares Lucas wrote: >> Dhamoder Nalla has updated the pull request incrementally with one additional commit since the last revision: >> >> address CR comments > > src/hotspot/share/opto/escape.cpp line 1310: > >> 1308: Node* use = ophi->fast_out(i); >> 1309: if (use->is_Phi()) { >> 1310: assert(use->_idx != ophi->_idx, "Unexpected selfloop Phi."); > > Should we bailout of the reduction process if we somehow end up in this situation? > I.e., in a debug build we'll assert, but in a product build you're just ignoring the problem. This assert is redundant; the self-loop nodes are already filtered out before reaching reduce_phi. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/21270#discussion_r2207356962 From aph at openjdk.org Tue Jul 15 12:39:39 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Jul 2025 12:39:39 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Tue, 15 Jul 2025 12:20:27 GMT, Samuel Chee wrote: > Great! Glad you're convinced of its correctness. > > > we can remove trailing DMBs from most CASALs. > > Just do bear in mind that CASAL doesn't emit release semantics if the compare fails so I imagine there might be cases where a trailing dmb might still necessary. Yes, that's what the change log in the link I posted says too. We don't care about such assumptions in Java code, but in some C++ code which assumes that CAS implies a full barrier. I don't think anyone really knows every bit of C++ code in HotSpot which does assume this, so we tend to assume the worst. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3073435357 From aph at openjdk.org Tue Jul 15 12:44:39 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Jul 2025 12:44:39 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: <8NL_uVAlbHrnK9t1Ec89Uk100mo0ADe-_ni9b7QXQss=.39f24034-b4d5-4b25-a2a7-d3930a50730f@github.com> Message-ID: On Tue, 15 Jul 2025 12:29:09 GMT, Andrew Dinn wrote: > > IMHO that doesn't help very much. You're still looking at ~4 cycles to load from L1 dcache, and you kill a dcache line for it, and you consume a full xword of code space for it. If we put the byte-map base on a 32-bit boundary we only need a single MOVZ, and that can be relocated easily enough when we load the code from the archive. Surely that's better. Or is even that small work too much? > > Well, it helps a bit even though it suffers from the faults you identify. > > The downside of using even a single MOVZ is that every on requries a reloc during AOT code reloading. So, the number of relocs in any code blob that we handle during loading is no longer 1. Instead it equals the number of post-barriers in the blob. > > The bigger hit will come when we try to optimize code loading by mmapping AOT code blobs into the code cache -- at present we copy-relocate it from an mmapped region. Instead of just one constant to patch at the start of the blob we will have many places to patch scattered throughout the code blob. So, many more copy-on-write pages rather than vanilla mapped pages. That drags back in copy overheads that the mmap is intended to avoid and also means less opportunities for co-hosted JVMs to share pages. Is anyone trying to load AOT blobs at a fixed address? If C2 can be persuaded to treat the BMB as a value to be propagated like any other value then all of this conversation effectively becomes a don't care. Except for legacy architectures with insufficient registers, of course... ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3073451278 From shade at openjdk.org Tue Jul 15 12:45:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Jul 2025 12:45:39 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 11:52:04 GMT, Manuel H?ssig wrote: > FWIW, tier1-tier3, and 100 repeats of `TestStressBailout.java` on Linux x64 & aarch64, Windows x64, and Mac aarch64 all passed. > > Let me know when I should kick off another round. Thank you, that is good to know! New version handles even more obscure corner case, that I doubt would show up easily :) My Linux x86_64 server fastdebug `make test TEST=all` run just completed without problems, so we can test this version more broadly as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26294#issuecomment-3073452977 From aph at openjdk.org Tue Jul 15 12:48:47 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 15 Jul 2025 12:48:47 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: <8kuDtuUPOl5CsWzgmgN9V0X5hXmGUExY4rpOfAfn1ic=.b7023a27-6362-4a7f-ba77-05cc0b50e5e3@github.com> On Tue, 15 Jul 2025 12:30:33 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2912: >> >>> 2910: // If control reaches here, then the Neon instructions would be executed and >>> 2911: // one of these conditions must satisfy - >>> 2912: // UseSVE == 0 || (UseSVE == 1 && length_in_bytes == 16) >> >> Why? Can't you make this logic correct regardless of `UseSVE`? > > So the Neon implementation gets kicked in when SVE is not available (UseSVE == 0) whether the vector length is 8 or 16 but we emit Neon instructions for UseSVE ==1 and vector length == 16 only. I am not sure how I can eliminate `UseSVE` here. > > When the vector length == 8 with SVE1, we generate the SVE `tbl` instruction (with single input). This is done for `T_INT` and `T_FLOAT` types so that we avoid generating the `mulv`/`addv` instructions for the Neon `tbl` instruction. But why would the Neon implementation fail if UseSVE ==1? Surely it would still work, and if it still works this comment is wrong. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2207381415 From dlunden at openjdk.org Tue Jul 15 12:59:41 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 15 Jul 2025 12:59:41 GMT Subject: RFR: 8342941: IGV: Add various new graph dumps during loop opts [v5] In-Reply-To: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> References: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> Message-ID: On Tue, 15 Jul 2025 12:09:31 GMT, Saranya Natarajan wrote: >> This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). >> >> Changes: >> - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. >> - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. >> - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. >> >> Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . >> 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` >> ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) >> 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled >> ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) >> 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` >> ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) >> 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` >> ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) >> >> Question to reviewers: >> Are the new compiler phases OK, or should we change anything? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > modifying one iteration loop to one-iteration Looks great now, thanks! ------------- Marked as reviewed by dlunden (Committer). PR Review: https://git.openjdk.org/jdk/pull/25756#pullrequestreview-3020172635 From duke at openjdk.org Tue Jul 15 14:05:25 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Tue, 15 Jul 2025 14:05:25 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10] In-Reply-To: References: Message-ID: > The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. > > Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: - removed tail processing with RVV instructions as simple scalar loop provides in general better results ------------- Changes: - all: https://git.openjdk.org/jdk/pull/17413/files - new: https://git.openjdk.org/jdk/pull/17413/files/6daaae6e..0c2fbee9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17413&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17413&range=08-09 Stats: 14 lines in 1 file changed: 0 ins; 14 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/17413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17413/head:pull/17413 PR: https://git.openjdk.org/jdk/pull/17413 From duke at openjdk.org Tue Jul 15 14:05:25 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Tue, 15 Jul 2025 14:05:25 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v9] In-Reply-To: <7od62MdoD83EGfh9UTcLLE1pkkvaZclx2c9sIiLB58M=.066b94a9-e66e-475a-8434-2a6160c7642c@github.com> References: <7od62MdoD83EGfh9UTcLLE1pkkvaZclx2c9sIiLB58M=.066b94a9-e66e-475a-8434-2a6160c7642c@github.com> Message-ID: On Tue, 15 Jul 2025 08:11:28 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: > > simplified arrays_hashcode_v() to be closer to VLA and use less general-purpose registers; minor cosmetic changes bpif3-16g% ( for i in "-XX:DisableIntrinsic=_vectorizedHashCode" "-XX:-UseRVV" "-XX:+UseRVV" ; \ do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java -jar benchmarks.jar \ --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" \ org.openjdk.bench.java.lang.ArraysHashCode.ints \ -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 \ -f 3 -r 1 -w 1 -wi 10 -i 10 2>&1 | tail -15 ) done ) --- -XX:DisableIntrinsic=_vectorizedHashCode --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.297 ? 0.021 ns/op ArraysHashCode.ints 5 avgt 30 28.907 ? 0.117 ns/op ArraysHashCode.ints 10 avgt 30 41.196 ? 0.218 ns/op ArraysHashCode.ints 20 avgt 30 68.403 ? 0.118 ns/op ArraysHashCode.ints 30 avgt 30 88.732 ? 0.506 ns/op ArraysHashCode.ints 40 avgt 30 115.166 ? 0.103 ns/op ArraysHashCode.ints 50 avgt 30 136.047 ? 0.487 ns/op ArraysHashCode.ints 60 avgt 30 161.985 ? 0.193 ns/op ArraysHashCode.ints 70 avgt 30 170.613 ? 0.506 ns/op ArraysHashCode.ints 80 avgt 30 194.457 ? 0.547 ns/op ArraysHashCode.ints 90 avgt 30 207.872 ? 0.305 ns/op ArraysHashCode.ints 100 avgt 30 231.960 ? 0.338 ns/op ArraysHashCode.ints 200 avgt 30 448.387 ? 1.186 ns/op ArraysHashCode.ints 300 avgt 30 655.308 ? 0.146 ns/op --- -XX:-UseRVV --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.295 ? 0.022 ns/op ArraysHashCode.ints 5 avgt 30 24.426 ? 0.005 ns/op ArraysHashCode.ints 10 avgt 30 35.734 ? 0.034 ns/op ArraysHashCode.ints 20 avgt 30 58.876 ? 0.015 ns/op ArraysHashCode.ints 30 avgt 30 82.964 ? 0.271 ns/op ArraysHashCode.ints 40 avgt 30 105.866 ? 0.027 ns/op ArraysHashCode.ints 50 avgt 30 129.875 ? 0.230 ns/op ArraysHashCode.ints 60 avgt 30 153.074 ? 0.331 ns/op ArraysHashCode.ints 70 avgt 30 176.633 ? 0.072 ns/op ArraysHashCode.ints 80 avgt 30 199.799 ? 0.049 ns/op ArraysHashCode.ints 90 avgt 30 223.666 ? 0.087 ns/op ArraysHashCode.ints 100 avgt 30 247.609 ? 0.447 ns/op ArraysHashCode.ints 200 avgt 30 481.884 ? 0.612 ns/op ArraysHashCode.ints 300 avgt 30 716.558 ? 0.197 ns/op --- -XX:+UseRVV --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.284 ? 0.016 ns/op ArraysHashCode.ints 5 avgt 30 21.298 ? 0.009 ns/op ArraysHashCode.ints 10 avgt 30 33.820 ? 0.007 ns/op ArraysHashCode.ints 20 avgt 30 58.937 ? 0.061 ns/op ArraysHashCode.ints 30 avgt 30 84.086 ? 0.132 ns/op ArraysHashCode.ints 40 avgt 30 99.785 ? 1.721 ns/op ArraysHashCode.ints 50 avgt 30 125.043 ? 1.614 ns/op ArraysHashCode.ints 60 avgt 30 147.438 ? 0.266 ns/op ArraysHashCode.ints 70 avgt 30 120.624 ? 1.068 ns/op ArraysHashCode.ints 80 avgt 30 144.821 ? 1.065 ns/op ArraysHashCode.ints 90 avgt 30 171.626 ? 0.052 ns/op ArraysHashCode.ints 100 avgt 30 140.918 ? 0.031 ns/op ArraysHashCode.ints 200 avgt 30 223.500 ? 1.228 ns/op ArraysHashCode.ints 300 avgt 30 316.135 ? 0.361 ns/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3073732444 From thartmann at openjdk.org Tue Jul 15 14:52:48 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Jul 2025 14:52:48 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v14] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 13:48:07 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Broken assertions fix With the latest version, I see this failure: java/lang/CompressExpandTest.java -Xcomp -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation -XX:+DeoptimizeALot 870 Phi === 820 227 872 [[ 339 339 120 282 334 224 338 338 119 119 337 337 336 336 340 120 335 335 432 275 283 224 340 334 ]] #int !orig=[98] !jvms: Assert::assertEquals @ bci:1 (line 797) Assert::assertEquals @ bci:3 (line 807) AbstractCompressExpandTest::assertContiguousMask @ bci:13 (line 356) AbstractCompressExpandTest::testContiguousMasksInt @ bci:48 (line 251) 917 LoadI === 916 218 219 [[ 336 336 331 119 119 339 339 432 275 337 337 338 338 224 224 340 340 120 120 342 330 118 117 116 115 341 343 114 113 112 111 344 121 110 333 332 335 335 282 439 109 329 328 334 334 222 327 122 254 369 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; #int (does not depend only on test, unknown control) !orig=[86] !jvms: AbstractCompressExpandTest::testContiguousMasksInt @ bci:39 (line 250) 282 ExpandBits === _ 917 870 [[ 120 120 340 340 224 224 338 338 736 338 339 339 380 ]] #int !jvms: CompressExpandTest::actualExpand @ bci:2 (line 40) AbstractCompressExpandTest::assertContiguousMask @ bci:17 (line 265) AbstractCompressExpandTest::testContiguousMasksInt @ bci:92 (line 256) told = bool tnew = int:1..2 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/open/src/hotspot/share/opto/phaseX.cpp:2731), pid=6038, tid=6057 # fatal error: Not monotonic # # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-15-0649374.tobias.hartmann.jdk4) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-15-0649374.tobias.hartmann.jdk4, compiled mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x1828409] PhaseCCP::verify_type(Node*, Type const*, Type const*)+0x169 #26314 Also happens with a few other configurations/flags. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3073939505 From thartmann at openjdk.org Tue Jul 15 15:05:44 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Jul 2025 15:05:44 GMT Subject: RFR: 8360701: Add bailout when the register allocator interference graph grows unreasonably large [v2] In-Reply-To: References: Message-ID: <2vWKKv_ArH3Op0xU_h6aIUJSByiPh4AhKbGPfFSvkOk=.866186ee-265e-43f6-b2c8-4624b1538f44@github.com> On Mon, 14 Jul 2025 15:14:22 GMT, Daniel Lund?n wrote: >> The changeset for JDK-8325467 (https://git.openjdk.org/jdk/pull/20404) enables compilation of methods with many parameters, which C2 previously bailed out on. As a side effect, the tests `BigArityTest.java`, `TestCatchExceptionWithVarargs.java`, and `VarargsArrayTest.java` compile more methods than before, and additionally these methods are designed, for stress testing purposes, to have a large number of parameters (at or close to the maximum of 255 parameters allowed by the JVM spec). >> >> Compiling such methods takes a very long time and >99% of the time is spent in the C2 phase Coalesce 2 (part of register allocation). The problem is that the interference graph becomes huge after the initial round of spilling (just before Coalesce 2), and that we do not check for this and bail out if necessary. We do already bail out if the number of IR nodes grows too large, but the interference graph can become huge even if we have a small number of nodes. In fact, the interference graph may (in the worst case) hava a size that is quadratic in the number of nodes. In the problematic tests, we have interference graphs with approximately 100 000 nodes and over 55 000 000 (!) IFG edges. For comparison, the IFG edge count in worst-case realistic scenarios caps out at around 40 000 nodes and 800 000 edges. For example, see the scatter matrix below from running the DaCapo benchmark. It displays, for each time an IFG was built, the number of current IR nodes, the number of live ranges (the actual nodes in the IFG), and the number of IFG edges. >> >> ![dacapo](https://github.com/user-attachments/assets/7a070768-50da-42e4-b5ed-9958e1362673) >> >> ### Changeset >> >> - Add a new diagnostic flag `IFGEdgesLimit` and bail out whenever we reach the number of edges specified by the flag during IFG construction. The default is a very generous 10 000 000 edges, that still filters out the most degenerate compilations we have seen. >> - Add tracking of edges in `PhaseIFG` to permit the new flag. >> >> It is worth noting that it is perhaps preferable to use a lower default than 10 000 000 edges. For example, in standard benchmarks such as DaCapo (see the scatter matrix above), Renaissance, SPECjvm, and SPECjbb, we never go over 1 000 000 edges (I verified this). The reason I went with the generous 10 000 000 limit is that I saw a fair amount of bailouts in testing with the flag set at 1 000 000 edges. Such bailouts are likely motivated, but I do not want to take any chances. Even at 10 000 ... > > Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/c2_globals.hpp > > Co-authored-by: Manuel H?ssig Thanks for the thorough analysis! The fix looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26118#pullrequestreview-3020843001 From fgao at openjdk.org Tue Jul 15 15:25:58 2025 From: fgao at openjdk.org (Fei Gao) Date: Tue, 15 Jul 2025 15:25:58 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 10:10:47 GMT, Xiaohong Gong wrote: >> Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance! > >> Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon. > > Testing on 256-bit SVE machines are fine to me. Thanks so much for your help! @XiaohongGong thanks for your work! Tier1 - tier3 passed on `256-bit sve` machine without new failures. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3074114305 From dlunden at openjdk.org Tue Jul 15 15:40:45 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 15 Jul 2025 15:40:45 GMT Subject: RFR: 8360701: Add bailout when the register allocator interference graph grows unreasonably large [v2] In-Reply-To: <11yDpTAB7uCCrx5givvBReRWXU4v_VMTQzhJKYMwXR4=.26076fc5-d6a3-4e00-be27-121fb04bce8b@github.com> References: <11yDpTAB7uCCrx5givvBReRWXU4v_VMTQzhJKYMwXR4=.26076fc5-d6a3-4e00-be27-121fb04bce8b@github.com> Message-ID: On Mon, 14 Jul 2025 15:41:05 GMT, Manuel H?ssig wrote: >> Daniel Lund?n has updated the pull request incrementally with one additional commit since the last revision: >> >> Update src/hotspot/share/opto/c2_globals.hpp >> >> Co-authored-by: Manuel H?ssig > > Thank you for elaborating. That makes sense. Thanks for the reviews @mhaessig and @TobiHartmann! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26118#issuecomment-3074177375 From dlunden at openjdk.org Tue Jul 15 15:40:46 2025 From: dlunden at openjdk.org (Daniel =?UTF-8?B?THVuZMOpbg==?=) Date: Tue, 15 Jul 2025 15:40:46 GMT Subject: Integrated: 8360701: Add bailout when the register allocator interference graph grows unreasonably large In-Reply-To: References: Message-ID: On Thu, 3 Jul 2025 18:13:13 GMT, Daniel Lund?n wrote: > The changeset for JDK-8325467 (https://git.openjdk.org/jdk/pull/20404) enables compilation of methods with many parameters, which C2 previously bailed out on. As a side effect, the tests `BigArityTest.java`, `TestCatchExceptionWithVarargs.java`, and `VarargsArrayTest.java` compile more methods than before, and additionally these methods are designed, for stress testing purposes, to have a large number of parameters (at or close to the maximum of 255 parameters allowed by the JVM spec). > > Compiling such methods takes a very long time and >99% of the time is spent in the C2 phase Coalesce 2 (part of register allocation). The problem is that the interference graph becomes huge after the initial round of spilling (just before Coalesce 2), and that we do not check for this and bail out if necessary. We do already bail out if the number of IR nodes grows too large, but the interference graph can become huge even if we have a small number of nodes. In fact, the interference graph may (in the worst case) hava a size that is quadratic in the number of nodes. In the problematic tests, we have interference graphs with approximately 100 000 nodes and over 55 000 000 (!) IFG edges. For comparison, the IFG edge count in worst-case realistic scenarios caps out at around 40 000 nodes and 800 000 edges. For example, see the scatter matrix below from running the DaCapo benchmark. It displays, for each time an IFG was built, the number of current IR nodes, the number of live ranges ( the actual nodes in the IFG), and the number of IFG edges. > > ![dacapo](https://github.com/user-attachments/assets/7a070768-50da-42e4-b5ed-9958e1362673) > > ### Changeset > > - Add a new diagnostic flag `IFGEdgesLimit` and bail out whenever we reach the number of edges specified by the flag during IFG construction. The default is a very generous 10 000 000 edges, that still filters out the most degenerate compilations we have seen. > - Add tracking of edges in `PhaseIFG` to permit the new flag. > > It is worth noting that it is perhaps preferable to use a lower default than 10 000 000 edges. For example, in standard benchmarks such as DaCapo (see the scatter matrix above), Renaissance, SPECjvm, and SPECjbb, we never go over 1 000 000 edges (I verified this). The reason I went with the generous 10 000 000 limit is that I saw a fair amount of bailouts in testing with the flag set at 1 000 000 edges. Such bailouts are likely motivated, but I do not want to take any chances. Even at 10 000 000 edges, a few tests s... This pull request has now been integrated. Changeset: 820263e4 Author: Daniel Lund?n URL: https://git.openjdk.org/jdk/commit/820263e48abf3ddce9506eb19872871aa3ea8b50 Stats: 38 lines in 4 files changed: 37 ins; 0 del; 1 mod 8360701: Add bailout when the register allocator interference graph grows unreasonably large Reviewed-by: mhaessig, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/26118 From mchevalier at openjdk.org Tue Jul 15 15:47:39 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 15 Jul 2025 15:47:39 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods In-Reply-To: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> Message-ID: On Tue, 15 Jul 2025 09:21:53 GMT, Beno?t Maillard wrote: > This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. > > ## Analysis > > We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: > > ```c++ > if (!ci_env.failing() && !task->is_success()) { > assert(ci_env.failure_reason() != nullptr, "expect failure reason"); > assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); > // The compiler elected, without comment, not to register a result. > // Do not attempt further compilations of this method. > ci_env.record_method_not_compilable("compile failed"); > } > > > The `task->is_success()` call accesses the private `_is_success` field. > This field is modified in `CompileTask::mark_success`. > > By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: > > CompileTask::mark_success compileTask.hpp:185 > nmethod::post_compiled_method nmethod.cpp:2212 > ciEnv::register_method ciEnv.cpp:1127 > Compilation::install_code c1_Compilation.cpp:425 > Compilation::compile_method c1_Compilation.cpp:488 > Compilation::Compilation c1_Compilation.cpp:609 > Compiler::compile_method c1_Compiler.cpp:262 > CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 > CompileBroker::compiler_thread_loop compileBroker.cpp:1968 > CompilerThread::thread_entry compilerThread.cpp:67 > JavaThread::thread_main_inner javaThread.cpp:773 > JavaThread::run javaThread.cpp:758 > Thread::call_run thread.cpp:243 > thread_native_entry os_linux.cpp:868 > > > We go up the stacktrace and see that in `Compilation::compile_method` we have: > > ```c++ > if (should_install_code()) { > // install code > PhaseTraceTime timeit(_t_codeinstall); > install_code(frame_size); > } > > > If we do not install methods after compilation, the code path that marks the success is never executed > and therefore results in hitting the assert. > > ### Fix > We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) > - [ ] tier1-3, plus some internal testing > - [x] Added a test that starts the VM with the `-XX:-InstallMethods` flag > > Thank you for reviewing! Nice description! src/hotspot/share/c1/c1_Compilation.cpp line 490: > 488: install_code(frame_size); > 489: } else { > 490: // nothing else to do Not sure how helpful is this comment: if one should not install the code, we don't install the code, but it doesn't really tells me why we need to do something then. I wouldn't mind no comment at all, and I can git blame and find the JBS issue/PR to see the motivation, but most of the time, I wouldn't question this line in particular, so I'm fine not having distractions in the code giving very specific details. There are very specific details I'm not challenging absolutely everywhere... Otherwise, maybe writing something like "making sure the lack of installed code is not confused with a compilation bailout". But once again, I'm rather on the side of "I'll look at the blame and PR if I ever question it, but I'll probably never". The PR explaining well the situation, I'm fine with it! test/hotspot/jtreg/compiler/c1/TestDisableInstallMethods.java line 45: > 43: ProcessBuilder pb = ProcessTools.createLimitedTestJavaProcessBuilder("-XX:-InstallMethods", "-version"); > 44: OutputAnalyzer output = new OutputAnalyzer(pb.start()); > 45: output.shouldHaveExitValue(0); Is it needed to go through a subprocess (at least, this explicitly)? Would `@run driver/othervm -XX:-InstallMethods compiler.c1.TestDisableInstallMethods` and an empty main (or just enough to do something) do the job? It seems simpler to me (doesn't need any import for instance), but also would allow testing it in interaction with other flags, as tests works. Then, with IgnoreUnrecognizedVMOptions, can we get rid of the `@requires`? Even tho... it means we are then testing an empty program with no actual flags... Might not be very interesting. ------------- PR Review: https://git.openjdk.org/jdk/pull/26310#pullrequestreview-3021010979 PR Review Comment: https://git.openjdk.org/jdk/pull/26310#discussion_r2207883258 PR Review Comment: https://git.openjdk.org/jdk/pull/26310#discussion_r2207867332 From fgao at openjdk.org Tue Jul 15 16:02:39 2025 From: fgao at openjdk.org (Fei Gao) Date: Tue, 15 Jul 2025 16:02:39 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: <1kurEbCQo0eiwZOGC93pjbxBCsZyqEaIurlzhV8v3wM=.70db96fc-db0d-40a4-a640-94c8130fdf2a@github.com> On Mon, 14 Jul 2025 10:10:47 GMT, Xiaohong Gong wrote: >> Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance! > >> Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon. > > Testing on 256-bit SVE machines are fine to me. Thanks so much for your help! @XiaohongGong Please correct me if I?m missing something or got anything wrong. Taking `short` on `512-bit` machine as an example, these instructions would be generated: // vgather sve_dup vtmp, 0 sve_load_0 => [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a] sve_uzp1 with vtmp => [00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa] // vgather1 sve_dup vtmp, 0 sve_load_1 => [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b] sve_uzp1 with vtmp => [00 00 00 00 00 00 00 00 bb bb bb bb bb bb bb bb] // Slice vgather1, vgather1 ext => [bb bb bb bb bb bb bb bb 00 00 00 00 00 00 00 00] // Or vgather, vslice sve_orr => [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa] Actually, we can get the target result directly by `uzp1` the output from `sve_load_0` and `sve_load_1`, like [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a] [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b] uzp1 => [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa] If so, the current design of `LoadVectorGather` may not be sufficiently low-level to suit `AArch64`. WDYT? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3074255909 From jbhateja at openjdk.org Tue Jul 15 16:36:46 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 15 Jul 2025 16:36:46 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v14] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 14:49:53 GMT, Tobias Hartmann wrote: > With the latest version, I see this failure: > > ``` > java/lang/CompressExpandTest.java > -Xcomp -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation -XX:+DeoptimizeALot > > 870 Phi === 820 227 872 [[ 339 339 120 282 334 224 338 338 119 119 337 337 336 336 340 120 335 335 432 275 283 224 340 334 ]] #int !orig=[98] !jvms: Assert::assertEquals @ bci:1 (line 797) Assert::assertEquals @ bci:3 (line 807) AbstractCompressExpandTest::assertContiguousMask @ bci:13 (line 356) AbstractCompressExpandTest::testContiguousMasksInt @ bci:48 (line 251) > 917 LoadI === 916 218 219 [[ 336 336 331 119 119 339 339 432 275 337 337 338 338 224 224 340 340 120 120 342 330 118 117 116 115 341 343 114 113 112 111 344 121 110 333 332 335 335 282 439 109 329 328 334 334 222 327 122 254 369 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; #int (does not depend only on test, unknown control) !orig=[86] !jvms: AbstractCompressExpandTest::testContiguousMasksInt @ bci:39 (line 250) > 282 ExpandBits === _ 917 870 [[ 120 120 340 340 224 224 338 338 736 338 339 339 380 ]] #int !jvms: CompressExpandTest::actualExpand @ bci:2 (line 40) AbstractCompressExpandTest::assertContiguousMask @ bci:17 (line 265) AbstractCompressExpandTest::testContiguousMasksInt @ bci:92 (line 256) > told = bool > tnew = int:1..2 > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/workspace/open/src/hotspot/share/opto/phaseX.cpp:2731), pid=6038, tid=6057 > # fatal error: Not monotonic > # > # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-15-0649374.tobias.hartmann.jdk4) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-15-0649374.tobias.hartmann.jdk4, compiled mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x1828409] PhaseCCP::verify_type(Node*, Type const*, Type const*)+0x169 > #26314 > ``` > > Also happens with a few other configurations/flags. Thanks @TobiHartmann , looking at this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3074269216 From thartmann at openjdk.org Tue Jul 15 16:55:44 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 15 Jul 2025 16:55:44 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods In-Reply-To: References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> Message-ID: On Tue, 15 Jul 2025 15:36:30 GMT, Marc Chevalier wrote: >> This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. >> >> ## Analysis >> >> We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: >> >> ```c++ >> if (!ci_env.failing() && !task->is_success()) { >> assert(ci_env.failure_reason() != nullptr, "expect failure reason"); >> assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); >> // The compiler elected, without comment, not to register a result. >> // Do not attempt further compilations of this method. >> ci_env.record_method_not_compilable("compile failed"); >> } >> >> >> The `task->is_success()` call accesses the private `_is_success` field. >> This field is modified in `CompileTask::mark_success`. >> >> By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: >> >> CompileTask::mark_success compileTask.hpp:185 >> nmethod::post_compiled_method nmethod.cpp:2212 >> ciEnv::register_method ciEnv.cpp:1127 >> Compilation::install_code c1_Compilation.cpp:425 >> Compilation::compile_method c1_Compilation.cpp:488 >> Compilation::Compilation c1_Compilation.cpp:609 >> Compiler::compile_method c1_Compiler.cpp:262 >> CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 >> CompileBroker::compiler_thread_loop compileBroker.cpp:1968 >> CompilerThread::thread_entry compilerThread.cpp:67 >> JavaThread::thread_main_inner javaThread.cpp:773 >> JavaThread::run javaThread.cpp:758 >> Thread::call_run thread.cpp:243 >> thread_native_entry os_linux.cpp:868 >> >> >> We go up the stacktrace and see that in `Compilation::compile_method` we have: >> >> ```c++ >> if (should_install_code()) { >> // install code >> PhaseTraceTime timeit(_t_codeinstall); >> install_code(frame_size); >> } >> >> >> If we do not install methods after compilation, the code path that marks the success is never executed >> and therefore results in hitting the assert. >> >> ### Fix >> We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) >> - [ ] tier1-3, plus some internal testing >> - [x] Added a test that starts the VM with the `-XX:-InstallMethods` ... > > test/hotspot/jtreg/compiler/c1/TestDisableInstallMethods.java line 45: > >> 43: ProcessBuilder pb = ProcessTools.createLimitedTestJavaProcessBuilder("-XX:-InstallMethods", "-version"); >> 44: OutputAnalyzer output = new OutputAnalyzer(pb.start()); >> 45: output.shouldHaveExitValue(0); > > Is it needed to go through a subprocess (at least, this explicitly)? Would `@run driver/othervm -XX:-InstallMethods compiler.c1.TestDisableInstallMethods` and an empty main (or just enough to do something) do the job? > > It seems simpler to me (doesn't need any import for instance), but also would allow testing it in interaction with other flags, as tests works. Then, with IgnoreUnrecognizedVMOptions, can we get rid of the `@requires`? Even tho... it means we are then testing an empty program with no actual flags... Might not be very interesting. I agree, you could also just add a test case to `test/hotspot/jtreg/compiler/arguments/TestC1Globals.java`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26310#discussion_r2208026912 From shade at openjdk.org Tue Jul 15 17:11:38 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Jul 2025 17:11:38 GMT Subject: RFR: 8362250: ARM32: forward_exception_entry missing return address In-Reply-To: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> References: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> Message-ID: On Tue, 15 Jul 2025 10:23:23 GMT, Boris Ulasevich wrote: > The ARM32 ForwardExceptionNode codegen needs to set the exception address to R5. And, since the https://github.com/openjdk/jdk/pull/20437 change, the TailCall generator does not need this because the StubRoutines::forward_exception_entry function is not called there. Looks right, but I reckon you want to adjust `format %{ ...` in both match rules as well. ------------- PR Review: https://git.openjdk.org/jdk/pull/26312#pullrequestreview-3021349646 From chagedorn at openjdk.org Tue Jul 15 17:21:45 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 15 Jul 2025 17:21:45 GMT Subject: RFR: 8342941: IGV: Add various new graph dumps during loop opts [v5] In-Reply-To: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> References: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> Message-ID: On Tue, 15 Jul 2025 12:09:31 GMT, Saranya Natarajan wrote: >> This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). >> >> Changes: >> - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. >> - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. >> - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. >> >> Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . >> 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` >> ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) >> 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled >> ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) >> 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` >> ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) >> 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` >> ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) >> >> Question to reviewers: >> Are the new compiler phases OK, or should we change anything? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > modifying one iteration loop to one-iteration Awesome, thanks for adding the duplicate loop backedge dumps! Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25756#pullrequestreview-3021387423 From bulasevich at openjdk.org Tue Jul 15 17:42:27 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 15 Jul 2025 17:42:27 GMT Subject: RFR: 8362250: ARM32: forward_exception_entry missing return address [v2] In-Reply-To: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> References: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> Message-ID: > The ARM32 ForwardExceptionNode codegen needs to set the exception address to R5. And, since the https://github.com/openjdk/jdk/pull/20437 change, the TailCall generator does not need this because the StubRoutines::forward_exception_entry function is not called there. Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: adjust ad rules format ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26312/files - new: https://git.openjdk.org/jdk/pull/26312/files/023c2216..fd61008b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26312&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26312&range=00-01 Stats: 4 lines in 1 file changed: 1 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26312.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26312/head:pull/26312 PR: https://git.openjdk.org/jdk/pull/26312 From shade at openjdk.org Tue Jul 15 18:37:41 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 15 Jul 2025 18:37:41 GMT Subject: RFR: 8362250: ARM32: forward_exception_entry missing return address [v2] In-Reply-To: References: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> Message-ID: <9QDgz1EU5v5rT4q4nuxnyyQvTklgi0NYsyT8NikCv30=.967d86fe-4a7d-49e4-a7b7-30fd31d4d937@github.com> On Tue, 15 Jul 2025 17:42:27 GMT, Boris Ulasevich wrote: >> The ARM32 ForwardExceptionNode codegen needs to set the exception address to R5. And, since the https://github.com/openjdk/jdk/pull/20437 change, the TailCall generator does not need this because the StubRoutines::forward_exception_entry function is not called there. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > adjust ad rules format Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26312#pullrequestreview-3021799623 From mli at openjdk.org Tue Jul 15 18:45:51 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 15 Jul 2025 18:45:51 GMT Subject: RFR: 8362284: RISC-V: cleanup NativeMovRegMem Message-ID: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> Hi, Can you help to review this simple patch? NativeMovRegMem on riscv is actually dead code, but still needed in case of compilation of C1. So make the code as simple as possible to avoid any reading and maintainance effort. No tests, as `offset()` and `set_offset()` were Unimplemented and used in C1 and never triggered before. Thanks! ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/26328/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26328&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362284 Stats: 40 lines in 2 files changed: 0 ins; 33 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/26328.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26328/head:pull/26328 PR: https://git.openjdk.org/jdk/pull/26328 From jkarthikeyan at openjdk.org Tue Jul 15 21:57:31 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Tue, 15 Jul 2025 21:57:31 GMT Subject: RFR: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI Message-ID: Hi all, This is a small fix for an assert failure in SuperWord truncation with ModI nodes. The failure itself is harmless and shouldn't lead to any miscompilations in product mode. I've added `ModI` to the assert switch and adapted the test in the bug report. Let me know what you think! ------------- Commit messages: - Fix truncation assert with ModI nodes Changes: https://git.openjdk.org/jdk/pull/26334/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26334&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362171 Stats: 14 lines in 2 files changed: 13 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26334.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26334/head:pull/26334 PR: https://git.openjdk.org/jdk/pull/26334 From dlong at openjdk.org Tue Jul 15 22:23:39 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 15 Jul 2025 22:23:39 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods In-Reply-To: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> Message-ID: On Tue, 15 Jul 2025 09:21:53 GMT, Beno?t Maillard wrote: > This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. > > ## Analysis > > We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: > > ```c++ > if (!ci_env.failing() && !task->is_success()) { > assert(ci_env.failure_reason() != nullptr, "expect failure reason"); > assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); > // The compiler elected, without comment, not to register a result. > // Do not attempt further compilations of this method. > ci_env.record_method_not_compilable("compile failed"); > } > > > The `task->is_success()` call accesses the private `_is_success` field. > This field is modified in `CompileTask::mark_success`. > > By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: > > CompileTask::mark_success compileTask.hpp:185 > nmethod::post_compiled_method nmethod.cpp:2212 > ciEnv::register_method ciEnv.cpp:1127 > Compilation::install_code c1_Compilation.cpp:425 > Compilation::compile_method c1_Compilation.cpp:488 > Compilation::Compilation c1_Compilation.cpp:609 > Compiler::compile_method c1_Compiler.cpp:262 > CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 > CompileBroker::compiler_thread_loop compileBroker.cpp:1968 > CompilerThread::thread_entry compilerThread.cpp:67 > JavaThread::thread_main_inner javaThread.cpp:773 > JavaThread::run javaThread.cpp:758 > Thread::call_run thread.cpp:243 > thread_native_entry os_linux.cpp:868 > > > We go up the stacktrace and see that in `Compilation::compile_method` we have: > > ```c++ > if (should_install_code()) { > // install code > PhaseTraceTime timeit(_t_codeinstall); > install_code(frame_size); > } > > > If we do not install methods after compilation, the code path that marks the success is never executed > and therefore results in hitting the assert. > > ### Fix > We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) > - [ ] tier1-3, plus some internal testing > - [x] Added a test that starts the VM with the `-XX:-InstallMethods` flag > > Thank you for reviewing! Is this flag really supported? I can't find any tests for it. I wonder if anyone would miss it if we removed it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26310#issuecomment-3075924647 From sviswanathan at openjdk.org Tue Jul 15 23:22:41 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Tue, 15 Jul 2025 23:22:41 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: <7ldMQV8jpynvbli7ioHwQCB-_y0LkwphDmxKqu5r9E0=.9cfc493b-64cd-48aa-82c6-bc3c4b680814@github.com> On Tue, 8 Jul 2025 22:44:55 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > rename to paired_push and paired_pop src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 112: > 110: } else { > 111: if (_result != rax) { > 112: __ paired_push(rax); No need to use paired_push on this else path as it is for non APX. src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 198: > 196: } > 197: } else { > 198: __ paired_pop(r11); No need to use paired pop on this else path as this is for non-APX. src/hotspot/cpu/x86/vm_version_x86.cpp line 156: > 154: // rcx and rdx are first and second argument registers on windows > 155: > 156: __ paired_push(rbp); We should not use paired push/pop in vm_version_x86.cpp. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2208843869 PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2208845470 PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2208815480 From sparasa at openjdk.org Tue Jul 15 23:56:33 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 15 Jul 2025 23:56:33 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v3] In-Reply-To: References: Message-ID: > The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. > > In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. > > Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: remove pushp/popp from vm_version_x86 and also when APX is not being used ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25889/files - new: https://git.openjdk.org/jdk/pull/25889/files/24e6da2c..2cc7c4b1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25889&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25889&range=01-02 Stats: 42 lines in 2 files changed: 0 ins; 0 del; 42 mod Patch: https://git.openjdk.org/jdk/pull/25889.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25889/head:pull/25889 PR: https://git.openjdk.org/jdk/pull/25889 From sparasa at openjdk.org Tue Jul 15 23:56:34 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Tue, 15 Jul 2025 23:56:34 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: <7ldMQV8jpynvbli7ioHwQCB-_y0LkwphDmxKqu5r9E0=.9cfc493b-64cd-48aa-82c6-bc3c4b680814@github.com> References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> <7ldMQV8jpynvbli7ioHwQCB-_y0LkwphDmxKqu5r9E0=.9cfc493b-64cd-48aa-82c6-bc3c4b680814@github.com> Message-ID: On Tue, 15 Jul 2025 22:41:17 GMT, Sandhya Viswanathan wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> rename to paired_push and paired_pop > > src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 198: > >> 196: } >> 197: } else { >> 198: __ paired_pop(r11); > > No need to use paired pop on this else path as this is for non-APX. Thanks for the catch! Removed the pushp/popp for the non-APX else path in the updated code. > src/hotspot/cpu/x86/vm_version_x86.cpp line 156: > >> 154: // rcx and rdx are first and second argument registers on windows >> 155: >> 156: __ paired_push(rbp); > > We should not use paired push/pop in vm_version_x86.cpp. Please see the updated code removing the pushp/popp from vm_version_x86.cpp. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2208915438 PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2208913711 From duke at openjdk.org Wed Jul 16 00:00:57 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 16 Jul 2025 00:00:57 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v7] In-Reply-To: <72OW9wHbET022fBnWx1Wdxb_J9pbH2sLiAqlC9fGb-c=.6930c0b1-33bb-4c49-af02-11e2c79dbaf2@github.com> References: <72OW9wHbET022fBnWx1Wdxb_J9pbH2sLiAqlC9fGb-c=.6930c0b1-33bb-4c49-af02-11e2c79dbaf2@github.com> Message-ID: On Thu, 8 May 2025 20:25:50 GMT, Chad Rakoczy wrote: > Okay. Speaking of which, seems like the NMethodState_lock is held for way too long - usually just held when setting the Method code and updating the nmethod state after the initial state is set. Keeping the lock across other things makes me worried of deadlocks. This is an old comment but looking at [JDK-8358821](https://bugs.openjdk.org/browse/JDK-8358821) made me wonder if `NMethodState_lock` actually should be held for the entirety of the relocation. Otherwise the nmethod could be marked not entrant after we perform the `is_in_use()` check. I don't think `CodeCache_lock` and `Compile_lock` are enough to prevent this. What do you think @dean-long @fisk ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3076155602 From duke at openjdk.org Wed Jul 16 00:03:50 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 16 Jul 2025 00:03:50 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v37] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 20:34:42 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: > > Revert is_always_within_branch_range changes I also don't think the `CodeCache_lock` can be acquired in the `relocate` function. This should be the responsibility of the caller. There is nothing preventing `relocate` from blocking on `CodeCache_lock` and the nmethod in which is waiting for the lock gets purged from the CodeCache ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3076195035 From sparasa at openjdk.org Wed Jul 16 00:06:53 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Wed, 16 Jul 2025 00:06:53 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v4] In-Reply-To: References: Message-ID: <-V4hpHvXdaDjmEyYzHcEpDJ2bzPTqoz2Ao8FLobkmB8=.d9e3b962-ae8d-4e4b-8ddb-c3ab42a2a619@github.com> > The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. > > In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. > > Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - merge with master - remove pushp/popp from vm_version_x86 and also when APX is not being used - rename to paired_push and paired_pop - 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs ------------- Changes: https://git.openjdk.org/jdk/pull/25889/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25889&range=03 Stats: 341 lines in 22 files changed: 19 ins; 0 del; 322 mod Patch: https://git.openjdk.org/jdk/pull/25889.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25889/head:pull/25889 PR: https://git.openjdk.org/jdk/pull/25889 From dhanalla at openjdk.org Wed Jul 16 01:45:47 2025 From: dhanalla at openjdk.org (Dhamoder Nalla) Date: Wed, 16 Jul 2025 01:45:47 GMT Subject: RFR: 8341293: Split field loads through Nested Phis [v10] In-Reply-To: References: Message-ID: > This enhances the changes introduced in [JDK PR 12897](https://github.com/openjdk/jdk/pull/12897) by handling nested Phi nodes (phi -> phi -> AddP -> Load*) during scalar replacement. The primary goal is to split field loads (AddP -> Load*) involving nested Phi parent nodes, thereby increasing opportunities for scalar replacement and reducing memory allocations. > > > **Here is an illustration of the sequence of Ideal Graph Transformations applied to split through nested `Phi` nodes.** > > **1. Initial State (Before Transformation)** > The graph contains a nested Phi structure where two Allocate nodes merge via a Phi node. > > ![image](https://github.com/user-attachments/assets/c18e5ca0-c554-475c-814a-7cb288d96569) > > **2. After Splitting Through Child Phi** > The transformation separates field loads by introducing additional AddP and Load nodes for each Allocate input. > > ![image](https://github.com/user-attachments/assets/b279b5f2-9ec6-4d9b-a627-506451f1cf81) > > **3. After Splitting Load Field Through Parent Phi** > The field load operation (Load) is pushed even further up in the graph. > > Instead of merging AddP pointers in a Phi node and then performing a Load, the transformation ensures that each path has its AddP -> Load sequence before merging. > > This further eliminates the need to perform field loads on a Phi node, making the graph more conducive to scalar replacement. > > ![image](https://github.com/user-attachments/assets/f506b918-2dd0-4dbe-a440-ff253afa3961) > > ### JMH Benchmark Results: > > #### With Disabled RAM > > | Benchmark | Mode | Count | Score | Error | Units | > |-----------|------|-------|-------|-------|-------| > | testBailOut_runner | avgt | 15 | 13.969 | ? 0.248 | ms/op | > | testFieldEscapeWithMerge_runner | avgt | 15 | 80.300 | ? 4.306 | ms/op | > | testMerge_TryCatchFinally_runner | avgt | 15 | 72.182 | ? 1.781 | ms/op | > | testMultiParentPhi_runner | avgt | 15 | 2.983 | ? 0.001 | ms/op | > | testNestedPhiPolymorphic_runner | avgt | 15 | 18.342 | ? 0.731 | ms/op | > | testNestedPhiProcessOrder_runner | avgt | 15 | 14.315 | ? 0.443 | ms/op | > | testNestedPhiWithLambda_runner | avgt | 15 | 18.511 | ? 1.212 | ms/op | > | testNestedPhiWithTrap_runner | avgt | 15 | 66.277 | ? 1.478 | ms/op | > | testNestedPhi_FieldLoad_runner | avgt | 15 | 17.968 | ? 0.306 | ms/op | > | testNestedPhi_TryCatch_runner | avgt | 15 | 14.186 | ? 0.247 | ms/op | > | testRematerialize_MultiObj_runner | avgt | 15 | 88.435 | ? 4.869 | ms/op | > | testRematerialize_SingleObj_runner | avgt | 15 | 29560.130 | ? 48.797 ... Dhamoder Nalla has updated the pull request incrementally with one additional commit since the last revision: address CR comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21270/files - new: https://git.openjdk.org/jdk/pull/21270/files/7947053b..ec176f20 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21270&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21270&range=08-09 Stats: 5 lines in 2 files changed: 2 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/21270.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21270/head:pull/21270 PR: https://git.openjdk.org/jdk/pull/21270 From dzhang at openjdk.org Wed Jul 16 02:04:41 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 16 Jul 2025 02:04:41 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v3] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: On Tue, 15 Jul 2025 07:21:44 GMT, Fei Yang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Use switch-case in min_vector_size > > src/hotspot/cpu/riscv/riscv.ad line 1999: > >> 1997: break; >> 1998: case T_SHORT: >> 1999: // To support vector type conversions between short and wider types. > > The code comment doesn't seem to reflect the purpose of this change. Can you improve it adding more details? Of course. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26239#discussion_r2209048762 From xgong at openjdk.org Wed Jul 16 03:46:50 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 16 Jul 2025 03:46:50 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: <8kuDtuUPOl5CsWzgmgN9V0X5hXmGUExY4rpOfAfn1ic=.b7023a27-6362-4a7f-ba77-05cc0b50e5e3@github.com> References: <8kuDtuUPOl5CsWzgmgN9V0X5hXmGUExY4rpOfAfn1ic=.b7023a27-6362-4a7f-ba77-05cc0b50e5e3@github.com> Message-ID: <-PJGHyiAZjJWaqqCJo4pNsrDrZOr8RkGp-r43y6HGfY=.ef3c4a1c-5557-400c-83f8-71ed2e19fe21@github.com> On Tue, 15 Jul 2025 12:45:51 GMT, Andrew Haley wrote: >> So the Neon implementation gets kicked in when SVE is not available (UseSVE == 0) whether the vector length is 8 or 16 but we emit Neon instructions for UseSVE ==1 and vector length == 16 only. I am not sure how I can eliminate `UseSVE` here. >> >> When the vector length == 8 with SVE1, we generate the SVE `tbl` instruction (with single input). This is done for `T_INT` and `T_FLOAT` types so that we avoid generating the `mulv`/`addv` instructions for the Neon `tbl` instruction. > > But why would the Neon implementation fail if UseSVE ==1? Surely it would still work, and if it still works this comment is wrong. Yeah, I think this still work when UseSVE >=1. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2209144612 From xgong at openjdk.org Wed Jul 16 03:50:48 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 16 Jul 2025 03:50:48 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 01:23:43 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Disable auto-vectorization of double to short conversion for NEON and update tests Ping again! Hi, may I have another approval? Thanks in advance! Hi @theRealAph , would you mind taking another looking at the latest commit please? Thanks a lot! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3076631092 From bchristi at openjdk.org Wed Jul 16 03:55:40 2025 From: bchristi at openjdk.org (Brent Christian) Date: Wed, 16 Jul 2025 03:55:40 GMT Subject: [jdk25] RFR: Merge 121f5a72e4c23919b3a3b474cc3f1ac29ec611af Message-ID: This brings in cpu25_07 changes. ------------- Commit messages: - 8360147: Better Glyph drawing redux - 8355884: [macos] java/awt/Frame/I18NTitle.java fails on MacOS - 8350991: Improve HTTP client header handling - 8349594: Enhance TLS protocol support - 8349584: Improve compiler processing - 8349111: Enhance Swing supports - 8348989: Better Glyph drawing - 8349551: Failures in tests after JDK-8345625 - 8345625: Better HTTP connections The merge commit only contains trivial merges, so no merge-specific webrevs have been generated. Changes: https://git.openjdk.org/jdk/pull/26340/files Stats: 358 lines in 21 files changed: 275 ins; 24 del; 59 mod Patch: https://git.openjdk.org/jdk/pull/26340.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26340/head:pull/26340 PR: https://git.openjdk.org/jdk/pull/26340 From jpai at openjdk.org Wed Jul 16 03:55:41 2025 From: jpai at openjdk.org (Jaikiran Pai) Date: Wed, 16 Jul 2025 03:55:41 GMT Subject: [jdk25] RFR: Merge 121f5a72e4c23919b3a3b474cc3f1ac29ec611af In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 03:40:43 GMT, Brent Christian wrote: > This brings in cpu25_07 changes. Marked as reviewed by jpai (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26340#pullrequestreview-3023001849 From bchristi at openjdk.org Wed Jul 16 04:01:42 2025 From: bchristi at openjdk.org (Brent Christian) Date: Wed, 16 Jul 2025 04:01:42 GMT Subject: [jdk25] RFR: Merge 121f5a72e4c23919b3a3b474cc3f1ac29ec611af [v2] In-Reply-To: References: Message-ID: > This brings in cpu25_07 changes. Brent Christian has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26340/files - new: https://git.openjdk.org/jdk/pull/26340/files/121f5a72..121f5a72 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26340&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26340&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26340.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26340/head:pull/26340 PR: https://git.openjdk.org/jdk/pull/26340 From bchristi at openjdk.org Wed Jul 16 04:01:43 2025 From: bchristi at openjdk.org (Brent Christian) Date: Wed, 16 Jul 2025 04:01:43 GMT Subject: [jdk25] Integrated: Merge 121f5a72e4c23919b3a3b474cc3f1ac29ec611af In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 03:40:43 GMT, Brent Christian wrote: > This brings in cpu25_07 changes. This pull request has now been integrated. Changeset: 0e6bf005 Author: Brent Christian URL: https://git.openjdk.org/jdk/commit/0e6bf0055057fae844748a300551549553f59f03 Stats: 358 lines in 21 files changed: 275 ins; 24 del; 59 mod Merge Reviewed-by: jpai ------------- PR: https://git.openjdk.org/jdk/pull/26340 From fyang at openjdk.org Wed Jul 16 04:15:44 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 16 Jul 2025 04:15:44 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: On Tue, 15 Jul 2025 10:28:56 GMT, Dingli Zhang wrote: >> Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. >> So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. >> >> ### Test >> qemu-system UseRVV: >> * [x] Run jdk_vector (fastdebug) >> * [x] Run compiler/vectorapi (fastdebug) >> >> ### Performance >> Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): >> >> >> Benchmark (SIZE) Mode Units Before After Gain >> VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 >> VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 >> VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 >> VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 >> >> PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Update comments Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26239#pullrequestreview-3023024830 From dzhang at openjdk.org Wed Jul 16 04:27:40 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 16 Jul 2025 04:27:40 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: On Tue, 15 Jul 2025 10:28:56 GMT, Dingli Zhang wrote: >> Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. >> So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. >> >> ### Test >> qemu-system UseRVV: >> * [x] Run jdk_vector (fastdebug) >> * [x] Run compiler/vectorapi (fastdebug) >> >> ### Performance >> Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): >> >> >> Benchmark (SIZE) Mode Units Before After Gain >> VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 >> VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 >> VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 >> VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 >> >> PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Update comments Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26239#issuecomment-3076679007 From duke at openjdk.org Wed Jul 16 04:27:40 2025 From: duke at openjdk.org (duke) Date: Wed, 16 Jul 2025 04:27:40 GMT Subject: RFR: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: On Tue, 15 Jul 2025 10:28:56 GMT, Dingli Zhang wrote: >> Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. >> So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. >> >> ### Test >> qemu-system UseRVV: >> * [x] Run jdk_vector (fastdebug) >> * [x] Run compiler/vectorapi (fastdebug) >> >> ### Performance >> Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): >> >> >> Benchmark (SIZE) Mode Units Before After Gain >> VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 >> VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 >> VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 >> VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 >> >> PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Update comments @DingliZhang Your change (at version 7120525bbbe39467fdf57eac5a2de7e8eb92d072) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26239#issuecomment-3076681413 From thartmann at openjdk.org Wed Jul 16 05:35:39 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 05:35:39 GMT Subject: RFR: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 21:52:23 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a small fix for an assert failure in SuperWord truncation with ModI nodes. The failure itself is harmless and shouldn't lead to any miscompilations in product mode. I've added `ModI` to the assert switch and adapted the test in the bug report. Let me know what you think! Thanks for quickly jumping on this! Looks good to me. I submitted testing and will report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26334#pullrequestreview-3023168478 From chagedorn at openjdk.org Wed Jul 16 05:38:42 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 16 Jul 2025 05:38:42 GMT Subject: RFR: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 21:52:23 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a small fix for an assert failure in SuperWord truncation with ModI nodes. The failure itself is harmless and shouldn't lead to any miscompilations in product mode. I've added `ModI` to the assert switch and adapted the test in the bug report. Let me know what you think! Looks good to me, too. Thanks for prioritizing this to get it in quickly. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26334#pullrequestreview-3023177481 From dzhang at openjdk.org Wed Jul 16 05:38:46 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 16 Jul 2025 05:38:46 GMT Subject: Integrated: 8361836: RISC-V: Relax min vector length to 32-bit for short vectors In-Reply-To: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> References: <6lMSTx2KYyTBXKfcdzKwe9Q0NhY_oFze7kiTs62ouEs=.34e01dff-3e96-4f17-91ab-4a60451e7497@github.com> Message-ID: On Thu, 10 Jul 2025 09:17:20 GMT, Dingli Zhang wrote: > Follow up [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419), RVV supports all vector type conversion APIs in the Vector API. > So we only need to relax the length limit of the short type to achieve a significant improvement in JMH performance for converting between short and wider types. > > ### Test > qemu-system UseRVV: > * [x] Run jdk_vector (fastdebug) > * [x] Run compiler/vectorapi (fastdebug) > > ### Performance > Following shows the performance improvement of relative VectorAPI JMHs on k1 (256-bit RVV): > > > Benchmark (SIZE) Mode Units Before After Gain > VectorFPtoIntCastOperations.microDouble128ToShort128 512 thrpt ops/ms 52.280 840.112 16.07 > VectorFPtoIntCastOperations.microDouble128ToShort128 1024 thrpt ops/ms 28.156 429.322 15.25 > VectorFPtoIntCastOperations.microFloat64ToShort64 512 thrpt ops/ms 14.242 479.509 33.67 > VectorFPtoIntCastOperations.microFloat64ToShort64 1024 thrpt ops/ms 6.906 242.690 35.14 > > PS: `VectorFPtoIntCastOperations.microFloat64ToShort64` is added by [JDK-8359419](https://bugs.openjdk.org/browse/JDK-8359419). This pull request has now been integrated. Changeset: bdd37b0e Author: Dingli Zhang Committer: SendaoYan URL: https://git.openjdk.org/jdk/commit/bdd37b0e5eaa984e2ad2e9010af37dcd612cc05e Stats: 39 lines in 2 files changed: 29 ins; 0 del; 10 mod 8361836: RISC-V: Relax min vector length to 32-bit for short vectors Reviewed-by: fyang, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/26239 From chagedorn at openjdk.org Wed Jul 16 05:42:41 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 16 Jul 2025 05:42:41 GMT Subject: RFR: 8358641: C1 option -XX:+TimeEachLinearScan is broken [v2] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 06:38:34 GMT, Saranya Natarajan wrote: >> **Issue** >> Using the command` java -Xcomp -XX:TieredStopAtLevel=1 -XX:+TimeEachLinearScan` results in an assert failure in line `assert(_cached_blocks.length() == ir()->linear_scan_order()->length()) failed: invalid cached block list`. >> >> **Suggestion** >> Removal of flag as this is a very old issue >> >> **Fix** >> Removed the flag by removing relevant methods and code while ensuring the removal does not affect other flags. > > Saranya Natarajan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - addressing review comments > - merge master > Merge branch 'master' of https://github.com/sarannat/jdk into JDK-8358641 > - Initial Fix Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25933#pullrequestreview-3023190834 From chagedorn at openjdk.org Wed Jul 16 05:43:40 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 16 Jul 2025 05:43:40 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test [v2] In-Reply-To: References: Message-ID: <2ZSJgkwLKqF48-n6ut1nmMANmQzNriWDgyp0WsDHxjQ=.29bcb649-8272-423c-a160-27c14fb9c0ed@github.com> On Mon, 14 Jul 2025 18:50:23 GMT, Saranya Natarajan wrote: >> **Issue** >> The last three parameters of `PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word, int mask, int bits, bool return_fast_path)` are unnecessary after the fix introduced in [JDK-8256425](https://bugs.openjdk.org/browse/JDK-8256425) >> >> **Fix** >> The proposed fix removes the last three parameters and makes the necessary modification to the methods. >> >> **Testing** >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > addressing review comments Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26276#pullrequestreview-3023193333 From chagedorn at openjdk.org Wed Jul 16 05:43:41 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 16 Jul 2025 05:43:41 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 18:46:30 GMT, Saranya Natarajan wrote: >> src/hotspot/share/opto/macro.cpp line 98: >> >>> 96: Node* PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word) { >>> 97: Node* cmp; >>> 98: cmp = word; >> >> Could now be merged (I cannot make a direct suggestion due to deleted lines): >> >> Node* cmp = word; > > Thank you. I have addressed this in the new commit Thanks for the update! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26276#discussion_r2209284555 From xgong at openjdk.org Wed Jul 16 05:56:40 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 16 Jul 2025 05:56:40 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: Message-ID: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> On Mon, 14 Jul 2025 10:10:47 GMT, Xiaohong Gong wrote: >> Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance! > >> Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon. > > Testing on 256-bit SVE machines are fine to me. Thanks so much for your help! > @XiaohongGong thanks for your work! Tier1 - tier3 passed on `256-bit sve` machine without new failures. Good! Thanks so much for your help! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3076927184 From jbhateja at openjdk.org Wed Jul 16 06:33:47 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 16 Jul 2025 06:33:47 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v14] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 16:04:47 GMT, Jatin Bhateja wrote: > > With the latest version, I see this failure: > > ``` > > java/lang/CompressExpandTest.java > > -Xcomp -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation -XX:+DeoptimizeALot > > > > 870 Phi === 820 227 872 [[ 339 339 120 282 334 224 338 338 119 119 337 337 336 336 340 120 335 335 432 275 283 224 340 334 ]] #int !orig=[98] !jvms: Assert::assertEquals @ bci:1 (line 797) Assert::assertEquals @ bci:3 (line 807) AbstractCompressExpandTest::assertContiguousMask @ bci:13 (line 356) AbstractCompressExpandTest::testContiguousMasksInt @ bci:48 (line 251) > > 917 LoadI === 916 218 219 [[ 336 336 331 119 119 339 339 432 275 337 337 338 338 224 224 340 340 120 120 342 330 118 117 116 115 341 343 114 113 112 111 344 121 110 333 332 335 335 282 439 109 329 328 334 334 222 327 122 254 369 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=7; #int (does not depend only on test, unknown control) !orig=[86] !jvms: AbstractCompressExpandTest::testContiguousMasksInt @ bci:39 (line 250) > > 282 ExpandBits === _ 917 870 [[ 120 120 340 340 224 224 338 338 736 338 339 339 380 ]] #int !jvms: CompressExpandTest::actualExpand @ bci:2 (line 40) AbstractCompressExpandTest::assertContiguousMask @ bci:17 (line 265) AbstractCompressExpandTest::testContiguousMasksInt @ bci:92 (line 256) > > told = bool > > tnew = int:1..2 > > # > > # A fatal error has been detected by the Java Runtime Environment: > > # > > # Internal Error (/workspace/open/src/hotspot/share/opto/phaseX.cpp:2731), pid=6038, tid=6057 > > # fatal error: Not monotonic > > # > > # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-15-0649374.tobias.hartmann.jdk4) > > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-15-0649374.tobias.hartmann.jdk4, compiled mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > > # Problematic frame: > > # V [libjvm.so+0x1828409] PhaseCCP::verify_type(Node*, Type const*, Type const*)+0x169 > > #26314 > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Also happens with a few other configurations/flags. > > Thanks @TobiHartmann , looking at this. Hi @TobiHartmann , I have been able to reproduce this issue with a smaller test, in my last commit, incorrectly constrained the lower bound of the result for unknown mask, due to the OCA limitation will create a new PR with an updated patch. public class expand_bits { public static long micro(long src, long mask) { long res = 0; mask = Math.max(0, Math.min(1, mask)); // mask = {lo:0, hi:1} for (int i = 0; i < 5; i++) { if (i == 4) { mask = 3; } else if (i == 2) { mask = 7; } // meet(3, 7) = (lo:3, hi:7) res += Long.expand(src, mask); } return res; } public static void main(String [] args) { long res = 0; for (int i = 0; i < 100000; i++) { res += micro(i, i + 10); } System.out.println("[res] " + res); } } With java -Xcomp -Xbatch -XX:-TieredCompilation -XX:CompileOnly=expand_bits::micro -XX:+TracePhaseCCP -cp . expand_bits 158 Phi === 152 120 107 [[ 151 157 161 ]] #long:3..7 !orig=[117] !jvms: expand_bits::micro @ bci:38 (line 11) 10 Parm === 3 [[ 157 121 78 67 45 56 151 ]] Parm0: long !jvms: expand_bits::micro @ bci:-1 (line 5) 157 ExpandBits === _ 10 158 [[ 156 ]] #long:3..7:www !orig=121 !jvms: expand_bits::micro @ bci:49 (line 16) told = long:0..7:www tnew = long:3..7:www # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/mnt/c/GitHub/jdk/src/hotspot/share/opto/phaseX.cpp:1806), pid=97823, tid=97837 # fatal error: Not monotonic ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3077054087 From xgong at openjdk.org Wed Jul 16 06:46:45 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 16 Jul 2025 06:46:45 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> Message-ID: On Wed, 16 Jul 2025 05:54:13 GMT, Xiaohong Gong wrote: >>> Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon. >> >> Testing on 256-bit SVE machines are fine to me. Thanks so much for your help! > >> @XiaohongGong thanks for your work! Tier1 - tier3 passed on `256-bit sve` machine without new failures. > > Good! Thanks so much for your help! > @XiaohongGong Please correct me if I?m missing something or got anything wrong. > > Taking `short` on `512-bit` machine as an example, these instructions would be generated: > > ``` > // vgather > sve_dup vtmp, 0 > sve_load_0 => [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a] > sve_uzp1 with vtmp => [00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa] > > // vgather1 > sve_dup vtmp, 0 > sve_load_1 => [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b] > sve_uzp1 with vtmp => [00 00 00 00 00 00 00 00 bb bb bb bb bb bb bb bb] > > // Slice vgather1, vgather1 > ext => [bb bb bb bb bb bb bb bb 00 00 00 00 00 00 00 00] > > // Or vgather, vslice > sve_orr => [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa] > ``` > > Actually, we can get the target result directly by `uzp1` the output from `sve_load_0` and `sve_load_1`, like > > ``` > [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a] > [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b] > uzp1 => > [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa] > ``` > > If so, the current design of `LoadVectorGather` may not be sufficiently low-level to suit `AArch64`. WDYT? Yes, you are right! This can work for truncating and merging two gather load results. But we have to consider other scenarios together: 1) No merging 2) Need 4 times of gather-loads and merging. Additionally, we have to make `LoadVectorGatherNode` common sense for all scenarios and different architectures. To make the IR itself simple and unify the inputs for all types on kinds of architectures, I choose to pass one `index` to it now, and define that one `LoadVectorGatherNode` just finish one time of gather-load with the `index`. The element type of the result should be the subword type. So a followed type truncating is needed anyway. I think this makes sense for a single gather-load operation for subword types, right? For cases that need more than 1 time of gather, I choose to generate multiple `LoadVectorGatherNode` and do the merging at last. And, I agree this may make the code less efficient than that of implementing with one `LoadVectorGatherNode` for all different scenarios. Writing backend assemblers for all scenarios can be more efficient. But this makes the backend implementation more complex. In additional to four normal gather cases, we have to consider the corresponding masked version and partial cases. BTW, the number of `index` passed to `LoadVectorGatherNode` will be different (e.g. 1, 2, 4), which makes the IR itself not easy to maintain. Regarding to the refinement based on your suggestion, - case-1: no merging - It's not an issue (current version is fine) - case-2: 2 times of gather and merge - Can be refined. But the `LoadVectorGatherNode` should be changed to accept 2 `index` vectors. - case-3: 4 times of gather and merge (only for byte) - Can be refined. We can implement it just like: step-1: `v1 = gather1 + gather2 + 2 * uzp1` // merging the first and second gather-loads step-2: `v2 = gather3 + gather4 + 2 * uzp1` // merging the third and fourth gather-loads step-3: `v3 = slice (v2, v2)`, `v = or(v1, v3)` // do the final merging We have to change `LoadVectorGatherNode` as well. At least making it accept 2 `index` vectors. As a summary, `LoadVectorGatherNode` will be more complex than before. But the good thing is, giving it one more `index` input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3077111123 From thartmann at openjdk.org Wed Jul 16 06:47:49 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 06:47:49 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 18:50:23 GMT, Saranya Natarajan wrote: >> **Issue** >> The last three parameters of `PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word, int mask, int bits, bool return_fast_path)` are unnecessary after the fix introduced in [JDK-8256425](https://bugs.openjdk.org/browse/JDK-8256425) >> >> **Fix** >> The proposed fix removes the last three parameters and makes the necessary modification to the methods. >> >> **Testing** >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > addressing review comments Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26276#pullrequestreview-3023394267 From thartmann at openjdk.org Wed Jul 16 06:48:48 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 06:48:48 GMT Subject: RFR: 8358641: C1 option -XX:+TimeEachLinearScan is broken [v2] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 06:38:34 GMT, Saranya Natarajan wrote: >> **Issue** >> Using the command` java -Xcomp -XX:TieredStopAtLevel=1 -XX:+TimeEachLinearScan` results in an assert failure in line `assert(_cached_blocks.length() == ir()->linear_scan_order()->length()) failed: invalid cached block list`. >> >> **Suggestion** >> Removal of flag as this is a very old issue >> >> **Fix** >> Removed the flag by removing relevant methods and code while ensuring the removal does not affect other flags. > > Saranya Natarajan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - addressing review comments > - merge master > Merge branch 'master' of https://github.com/sarannat/jdk into JDK-8358641 > - Initial Fix Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25933#pullrequestreview-3023397314 From thartmann at openjdk.org Wed Jul 16 06:51:45 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 06:51:45 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v14] In-Reply-To: References: Message-ID: <10jWxhtjQENzTBjlNDFKhHQMN-ioETq3P6_qmVTq3bo=.0124e215-5c09-44c3-8dcb-cd692789907a@github.com> On Mon, 14 Jul 2025 13:48:07 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Broken assertions fix Thanks @jatin-bhateja. Isn't the OCA signature status verification independent of the PR? Let me ping a few people here to get it done. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3077126992 From dlong at openjdk.org Wed Jul 16 07:01:50 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 16 Jul 2025 07:01:50 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v7] In-Reply-To: References: <72OW9wHbET022fBnWx1Wdxb_J9pbH2sLiAqlC9fGb-c=.6930c0b1-33bb-4c49-af02-11e2c79dbaf2@github.com> Message-ID: On Tue, 15 Jul 2025 23:57:41 GMT, Chad Rakoczy wrote: > Speaking of which, seems like the NMethodState_lock is held for way too long I can't find who posted this and what lines it refers to. If it refers to nmethod::relocate, I don't think the lock is needed after 8358821, because nobody will be patching the relocations. > Otherwise the nmethod could be marked not entrant after we perform the is_in_use() check The source nmethod? I don't see how that would cause a problem for that small block of code. All it does to the source is call make_not_used(). ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3077174999 From thartmann at openjdk.org Wed Jul 16 07:10:39 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 07:10:39 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods In-Reply-To: References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> Message-ID: On Tue, 15 Jul 2025 22:20:48 GMT, Dean Long wrote: > Is this flag really supported? I can't find any tests for it. I wonder if anyone would miss it if we removed it. Right, I think in general it might have some values for testing when we don't want to pollute the code cache (similar to what we do for `RepeatCompilation`) but then again it should also be done for C2. Let's just remove it for now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26310#issuecomment-3077237333 From snatarajan at openjdk.org Wed Jul 16 07:42:43 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 16 Jul 2025 07:42:43 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 18:50:23 GMT, Saranya Natarajan wrote: >> **Issue** >> The last three parameters of `PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word, int mask, int bits, bool return_fast_path)` are unnecessary after the fix introduced in [JDK-8256425](https://bugs.openjdk.org/browse/JDK-8256425) >> >> **Fix** >> The proposed fix removes the last three parameters and makes the necessary modification to the methods. >> >> **Testing** >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > addressing review comments Thank you for the reviews. Please sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26276#issuecomment-3077367957 From duke at openjdk.org Wed Jul 16 07:42:43 2025 From: duke at openjdk.org (duke) Date: Wed, 16 Jul 2025 07:42:43 GMT Subject: RFR: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test [v2] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 18:50:23 GMT, Saranya Natarajan wrote: >> **Issue** >> The last three parameters of `PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word, int mask, int bits, bool return_fast_path)` are unnecessary after the fix introduced in [JDK-8256425](https://bugs.openjdk.org/browse/JDK-8256425) >> >> **Fix** >> The proposed fix removes the last three parameters and makes the necessary modification to the methods. >> >> **Testing** >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > addressing review comments @sarannat Your change (at version 1b6be049d4455da3e9102cb13033f78ce1b35dd8) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26276#issuecomment-3077372857 From bkilambi at openjdk.org Wed Jul 16 07:43:41 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 16 Jul 2025 07:43:41 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v2] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 01:24:08 GMT, Xiaohong Gong wrote: >> src/hotspot/cpu/aarch64/aarch64_vector.ad line 5990: >> >>> 5988: %} >>> 5989: >>> 5990: instruct vmaskwiden_hi_sve(pReg dst, pReg src) %{ >> >> can both the hi and lo widen rules be combined into a single one as the arguments are the same? or would it make it less understandable? > > The main problem is that we cannot get the flag of `__is_lo` easily from the relative machnode as far as I know. Agreed. I remember I had the same problem with `requires_strict_order` field in ReductionNodes. Thanks. >> src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 352: >> >>> 350: // SVE requires vector indices for gather-load/scatter-store operations >>> 351: // on all data types. >>> 352: bool Matcher::gather_scatter_needs_vector_index(BasicType bt) { >> >> There's already a function that tests for `UseSVE > 0` here - https://github.com/openjdk/jdk/blob/bcd86d575fe0682a234228c18b0c2e817d3816da/src/hotspot/cpu/aarch64/matcher_aarch64.hpp#L36 >> >> Can it be reused? > > Do you mean directly using `supports_scalable_vector` instead of the new added method in mid-end? I'm afraid we cannot use it. Because on X86, the indexes for subword types are passed with address of the index array, while it's a vector for other types even on AVX-512. > > But yes, we can call `supports_scalable_vector()` in the new added method for AArch64. Got it, thanks! I missed the point that this was added in the mid-end. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2209554006 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2209551617 From snatarajan at openjdk.org Wed Jul 16 07:44:51 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 16 Jul 2025 07:44:51 GMT Subject: RFR: 8342941: IGV: Add various new graph dumps during loop opts [v5] In-Reply-To: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> References: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> Message-ID: On Tue, 15 Jul 2025 12:09:31 GMT, Saranya Natarajan wrote: >> This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). >> >> Changes: >> - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. >> - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. >> - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. >> >> Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . >> 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` >> ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) >> 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled >> ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) >> 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` >> ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) >> 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` >> ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) >> >> Question to reviewers: >> Are the new compiler phases OK, or should we change anything? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > modifying one iteration loop to one-iteration Thank you for the reviews. Please sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25756#issuecomment-3077372639 From duke at openjdk.org Wed Jul 16 07:44:51 2025 From: duke at openjdk.org (duke) Date: Wed, 16 Jul 2025 07:44:51 GMT Subject: RFR: 8342941: IGV: Add various new graph dumps during loop opts [v5] In-Reply-To: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> References: <2O1F6Lj8vy0qWs_qHqmXFPkwbuOqHx1NheZsroEYKbc=.bafb3eba-164b-4c13-8c27-346d44d43486@github.com> Message-ID: On Tue, 15 Jul 2025 12:09:31 GMT, Saranya Natarajan wrote: >> This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). >> >> Changes: >> - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. >> - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. >> - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. >> >> Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . >> 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` >> ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) >> 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled >> ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) >> 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` >> ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) >> 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` >> ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) >> >> Question to reviewers: >> Are the new compiler phases OK, or should we change anything? >> >> Testing: >> GitHub Actions >> tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. >> Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > modifying one iteration loop to one-iteration @sarannat Your change (at version 4f531f1a4f6581cfaeac0ad4ffc852cd47885b74) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25756#issuecomment-3077376975 From snatarajan at openjdk.org Wed Jul 16 07:47:47 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 16 Jul 2025 07:47:47 GMT Subject: Integrated: 8342941: IGV: Add various new graph dumps during loop opts In-Reply-To: References: Message-ID: On Wed, 11 Jun 2025 13:54:20 GMT, Saranya Natarajan wrote: > This changeset adds BEFORE/AFTER graph dumps for creating a post loop (`insert_post_loop()`), removing an empty loop (`do_remove_empty_loop()`), and removing a one iteration loop (`do_one_iteration_loop()`). > > Changes: > - Added `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` for dumping graphs before and after `insert_post_loop()`. > - Added `BEFORE_REMOVE_EMPTY_LOOP` and `AFTER_REMOVE_EMPTY_LOOP` for dumping graphs before and after `do_remove_empty_loop()`. > - Added `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` for dumping graphs before and after `do_one_iteration_loop()`. > > Below are sample screenshots (IGV print level 4 ) mainly showing the new phase . > 1. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` > ![image](https://github.com/user-attachments/assets/1661cede-5d70-4e0d-abec-3d091c7675c8) > 2. `BEFORE_POST_LOOP` and `AFTER_POST_LOOP` with SuperWordLoopUnrollAnalysis enabled > ![image](https://github.com/user-attachments/assets/6a22e6f0-4e6c-4e9d-8b6b-2bf75fac783d) > 3.` BEFORE_REMOVE_EMPTY_LOOP `and `AFTER_REMOVE_EMPTY_LOOP` > ![image](https://github.com/user-attachments/assets/3281f00b-575e-4604-83dd-831037d8dd47) > 4. `BEFORE_ONE_ITERATION_LOOP` and `AFTER_ONE_ITERATION_LOOP` > ![image](https://github.com/user-attachments/assets/efddbc9a-64f7-403d-acfe-330d75a00911) > > Question to reviewers: > Are the new compiler phases OK, or should we change anything? > > Testing: > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. > Tested that thousands of graphs are correctly opened and visualized with IGV using the same test used in ([JDK-8317349](https://bugs.openjdk.org/browse/JDK-8317349)) This pull request has now been integrated. Changeset: 805f1dee Author: Saranya Natarajan Committer: Daniel Lund?n URL: https://git.openjdk.org/jdk/commit/805f1deebcf465ba10672a829f0a8c3e11716f9d Stats: 33 lines in 7 files changed: 26 ins; 0 del; 7 mod 8342941: IGV: Add various new graph dumps during loop opts Reviewed-by: chagedorn, dlunden ------------- PR: https://git.openjdk.org/jdk/pull/25756 From snatarajan at openjdk.org Wed Jul 16 07:51:47 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 16 Jul 2025 07:51:47 GMT Subject: Integrated: 8353276: C2: simplify PhaseMacroExpand::opt_bits_test In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 21:53:35 GMT, Saranya Natarajan wrote: > **Issue** > The last three parameters of `PhaseMacroExpand::opt_bits_test(Node* ctrl, Node* region, int edge, Node* word, int mask, int bits, bool return_fast_path)` are unnecessary after the fix introduced in [JDK-8256425](https://bugs.openjdk.org/browse/JDK-8256425) > > **Fix** > The proposed fix removes the last three parameters and makes the necessary modification to the methods. > > **Testing** > GitHub Actions > tier1 to tier5 on windows-x64, linux-x64, linux-aarch64, macosx-x64, and macosx-aarch64. This pull request has now been integrated. Changeset: 9f7dc19f Author: Saranya Natarajan Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/9f7dc19ffded4608dd2c1ef1e4eacfa0d0a199ea Stats: 16 lines in 2 files changed: 0 ins; 11 del; 5 mod 8353276: C2: simplify PhaseMacroExpand::opt_bits_test Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/26276 From snatarajan at openjdk.org Wed Jul 16 08:00:50 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 16 Jul 2025 08:00:50 GMT Subject: RFR: 8358641: C1 option -XX:+TimeEachLinearScan is broken [v2] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 06:38:34 GMT, Saranya Natarajan wrote: >> **Issue** >> Using the command` java -Xcomp -XX:TieredStopAtLevel=1 -XX:+TimeEachLinearScan` results in an assert failure in line `assert(_cached_blocks.length() == ir()->linear_scan_order()->length()) failed: invalid cached block list`. >> >> **Suggestion** >> Removal of flag as this is a very old issue >> >> **Fix** >> Removed the flag by removing relevant methods and code while ensuring the removal does not affect other flags. > > Saranya Natarajan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - addressing review comments > - merge master > Merge branch 'master' of https://github.com/sarannat/jdk into JDK-8358641 > - Initial Fix Thank you for the reviews. Please sponsor. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25933#issuecomment-3077427714 From snatarajan at openjdk.org Wed Jul 16 08:00:51 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 16 Jul 2025 08:00:51 GMT Subject: Integrated: 8358641: C1 option -XX:+TimeEachLinearScan is broken In-Reply-To: References: Message-ID: On Mon, 23 Jun 2025 09:43:28 GMT, Saranya Natarajan wrote: > **Issue** > Using the command` java -Xcomp -XX:TieredStopAtLevel=1 -XX:+TimeEachLinearScan` results in an assert failure in line `assert(_cached_blocks.length() == ir()->linear_scan_order()->length()) failed: invalid cached block list`. > > **Suggestion** > Removal of flag as this is a very old issue > > **Fix** > Removed the flag by removing relevant methods and code while ensuring the removal does not affect other flags. This pull request has now been integrated. Changeset: 6b4a5ef1 Author: Saranya Natarajan Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/6b4a5ef105ee548627a53e2b983eab7972e33669 Stats: 50 lines in 3 files changed: 0 ins; 49 del; 1 mod 8358641: C1 option -XX:+TimeEachLinearScan is broken Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/25933 From bkilambi at openjdk.org Wed Jul 16 08:38:49 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 16 Jul 2025 08:38:49 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: <8kuDtuUPOl5CsWzgmgN9V0X5hXmGUExY4rpOfAfn1ic=.b7023a27-6362-4a7f-ba77-05cc0b50e5e3@github.com> References: <8kuDtuUPOl5CsWzgmgN9V0X5hXmGUExY4rpOfAfn1ic=.b7023a27-6362-4a7f-ba77-05cc0b50e5e3@github.com> Message-ID: <7Az-yP8D4rH5uwzgkQMaFcK-t0cD8wMMlzjKIRXDvis=.9695b1bc-c9e0-490f-9371-41840cfc0a42@github.com> On Tue, 15 Jul 2025 12:45:51 GMT, Andrew Haley wrote: >> So the Neon implementation gets kicked in when SVE is not available (UseSVE == 0) whether the vector length is 8 or 16 but we emit Neon instructions for UseSVE ==1 and vector length == 16 only. I am not sure how I can eliminate `UseSVE` here. >> >> When the vector length == 8 with SVE1, we generate the SVE `tbl` instruction (with single input). This is done for `T_INT` and `T_FLOAT` types so that we avoid generating the `mulv`/`addv` instructions for the Neon `tbl` instruction. > > But why would the Neon implementation fail if UseSVE ==1? Surely it would still work, and if it still works this comment is wrong. @theRealAph Thanks for your comments. The Neon implementation would not fail if UseSVE == 1 (Does the comment imply something like this?). Only that we are making a choice of generating Neon instructions for UseSVE = 1 and vec_len = 16. The conditions that can reach this method are - UseSVE = 0, 1 with vec_len = 8 or 16 and UseSVE = 2 with any vec_len (based on the conditions in `Matcher::match_rule_supported_vector()`). We have already filtered out `UseSVE = 1` with `vec_len = 8` and `UseSVE = 2` at line #2904. So if the control reaches #2915 then it's either `UseSVE = 0` with any vec_len and `UseSVE = 1` with `vec_len = 16` and that's what the comment mentions. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2209676084 From mhaessig at openjdk.org Wed Jul 16 08:44:42 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 16 Jul 2025 08:44:42 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 12:42:34 GMT, Aleksey Shipilev wrote: > [...] we can test this version more broadly as well. tier1 - tier3 and 100 repeats of TestStressBailout.java on Linux x64 & aarch64, Windows x64, and Mac x64 & aarch64 all passed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26294#issuecomment-3077573250 From fyang at openjdk.org Wed Jul 16 09:26:42 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 16 Jul 2025 09:26:42 GMT Subject: RFR: 8362284: RISC-V: cleanup NativeMovRegMem In-Reply-To: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> References: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> Message-ID: On Tue, 15 Jul 2025 18:41:56 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple patch? > > NativeMovRegMem on riscv is actually dead code, but still needed in case of compilation of C1. > So make the code as simple as possible to avoid any reading and maintainance effort. > > No tests, as `offset()` and `set_offset()` were Unimplemented and used in C1 and never triggered before. > > Thanks! Thanks for the cleanup. Looks fine modulo one minor comment. src/hotspot/cpu/riscv/nativeInst_riscv.hpp line 232: > 230: inline NativeMovRegMem* nativeMovRegMem_at(address addr) { > 231: Unimplemented(); > 232: return (NativeMovRegMem*)0; Maybe: `return (NativeMovRegMem*)nullptr;` ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26328#pullrequestreview-3023955944 PR Review Comment: https://git.openjdk.org/jdk/pull/26328#discussion_r2209781232 From adinn at openjdk.org Wed Jul 16 09:32:42 2025 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 16 Jul 2025 09:32:42 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: <8NL_uVAlbHrnK9t1Ec89Uk100mo0ADe-_ni9b7QXQss=.39f24034-b4d5-4b25-a2a7-d3930a50730f@github.com> Message-ID: On Tue, 15 Jul 2025 12:42:02 GMT, Andrew Haley wrote: > Is anyone trying to load AOT blobs at a fixed address? That's not really a good idea for security. However, even if we did we would still need AOT code to refer to external symbols that may be at different locations in the Assembly (AOT compile) VM and production VM. In particular the BMB varies depending on where the card table and heap base are located, both determined by runtime allocations. n.b. it is important to note that BMB is not the base of the table. It is a pre-offset variant on that base address, computed by adding (card table base - heap base) >> log(card table granularity). As you are no doubt aware, a pre-offset address constant avoids the need for the barrier to do anything but shift a heap pointer down and add it to BMB to compute the associated card address. That explains why ConP(BMB) is problematic. It is not -- as often stated -- because the value is not an address (offseting an address by a constant still gives an address). The problem is rather what value that address might have. Depending on where the two regions are placed it could actually be, say, ConP(0) or some other ConP that we want to treat specially (especially a ConP in the oop range). > If C2 can be persuaded to treat the BMB as a value to be propagated like any other value then all of this conversation effectively becomes a don't care. Except for legacy architectures with insufficient registers, of course... In principle that can already happen when the BMB is inserted into a C2 graph as a ConP -- although I have never seen it actually occurring. However, we have recently migrated barrier insertion to the back end for G1, Shenanadoah and ZGC, So, there is no opportunity for the compiler to perform constant elision in those three cases. The only GCs for which a ConP(BMB) is currently inserted into the graph are (generational) Serial/Parallel. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3077742345 From roland at openjdk.org Wed Jul 16 09:38:18 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 16 Jul 2025 09:38:18 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v36] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 10:35:21 GMT, Christian Hagedorn wrote: > I gave your latest patch another spin in our testing. It's still running but it already found some issues: Thanks! All failures should be fixed now. I added a test case for that one: > ``` > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/opt/mach5/mesos/work_dir/slaves/d2398cde-9325-49c3-b030-8961a4f0a253-S650407/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/05605dc0-bf5e-434a-82b5-65af69c62ec6/runs/591d89b1-11c0-415e-b2ce-4c0a13ce80f8/workspace/open/src/hotspot/share/opto/vectorization.cpp:141), pid=704535, tid=704555 > # assert(_cl->is_multiversion_fast_loop() == (_multiversioning_fast_proj != nullptr)) failed: must find the multiversion selector IFF loop is a multiversion fast loop > > Current CompileTask: > C2:7789 1280 jdk.incubator.vector.ByteVector::ldLongOp (48 bytes) > > Stack: [0x00007f9ef7cfe000,0x00007f9ef7dfe000], sp=0x00007f9ef7df8560, free space=1001k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0x1bcb7a4] VLoop::check_preconditions_helper() [clone .part.0]+0x824 (vectorization.cpp:141) > V [libjvm.so+0x1bcba31] VLoop::check_preconditions()+0x41 (vectorization.cpp:41) > V [libjvm.so+0x1573ea1] PhaseIdealLoop::auto_vectorize(IdealLoopTree*, VSharedData&)+0x241 (loopopts.cpp:4449) > V [libjvm.so+0x155274d] PhaseIdealLoop::build_and_optimize()+0xfdd (loopnode.cpp:5270) > [...] > ``` Would you mind re-running testing? ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-3077761301 From roland at openjdk.org Wed Jul 16 09:38:18 2025 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 16 Jul 2025 09:38:18 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v37] In-Reply-To: References: Message-ID: > To optimize a long counted loop and long range checks in a long or int > counted loop, the loop is turned into a loop nest. When the loop has > few iterations, the overhead of having an outer loop whose backedge is > never taken, has a measurable cost. Furthermore, creating the loop > nest usually causes one iteration of the loop to be peeled so > predicates can be set up. If the loop is short running, then it's an > extra iteration that's run with range checks (compared to an int > counted loop with int range checks). > > This change doesn't create a loop nest when: > > 1- it can be determined statically at loop nest creation time that the > loop runs for a short enough number of iterations > > 2- profiling reports that the loop runs for no more than ShortLoopIter > iterations (1000 by default). > > For 2-, a guard is added which is implemented as yet another predicate. > > While this change is in principle simple, I ran into a few > implementation issues: > > - while c2 has a way to compute the number of iterations of an int > counted loop, it doesn't have that for long counted loop. The > existing logic for int counted loops promotes values to long to > avoid overflows. I reworked it so it now works for both long and int > counted loops. > > - I added a new deoptimization reason (Reason_short_running_loop) for > the new predicate. Given the number of iterations is narrowed down > by the predicate, the limit of the loop after transformation is a > cast node that's control dependent on the short running loop > predicate. Because once the counted loop is transformed, it is > likely that range check predicates will be inserted and they will > depend on the limit, the short running loop predicate has to be the > one that's further away from the loop entry. Now it is also possible > that the limit before transformation depends on a predicate > (TestShortRunningLongCountedLoopPredicatesClone is an example), we > can have: new predicates inserted after the transformation that > depend on the casted limit that itself depend on old predicates > added before the transformation. To solve this cicular dependency, > parse and assert predicates are cloned between the old predicates > and the loop head. The cloned short running loop parse predicate is > the one that's used to insert the short running loop predicate. > > - In the case of a long counted loop, the loop is transformed into a > regular loop with a new limit and transformed range checks that's > later turned into an in counted loop. The int ... Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: test failures ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21630/files - new: https://git.openjdk.org/jdk/pull/21630/files/bb69cc02..9cae7ead Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=36 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21630&range=35-36 Stats: 72 lines in 4 files changed: 69 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/21630.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21630/head:pull/21630 PR: https://git.openjdk.org/jdk/pull/21630 From aph at openjdk.org Wed Jul 16 10:30:42 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 16 Jul 2025 10:30:42 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: <8NL_uVAlbHrnK9t1Ec89Uk100mo0ADe-_ni9b7QXQss=.39f24034-b4d5-4b25-a2a7-d3930a50730f@github.com> Message-ID: On Wed, 16 Jul 2025 09:29:49 GMT, Andrew Dinn wrote: > That explains why ConP(BMB) is problematic. It is not -- as often stated -- because the value is not an address (offseting an address by a constant still gives an address). The problem is rather what value that address might have. We-ell, hmm. I'm not sure I agree with that. On AArch64, addresses are either 48, 52, 0r 56 bits in length. Anything outside that is, literally, not an address. I guess one could call such things "unmappable addresses", but then we'd be arguing about definitions, which is usually sterile and pointless. > > If C2 can be persuaded to treat the BMB as a value to be propagated like any other value then all of this conversation effectively becomes a don't care. Except for legacy architectures with insufficient registers, of course... > > In principle that can already happen when the BMB is inserted into a C2 graph as a ConP -- although I have never seen it actually occurring. However, we have recently migrated barrier insertion to the back end for G1, Shenanadoah and ZGC, So, there is no opportunity for the compiler to perform constant elision in those three cases. That's fixable. We'd need the late barrier expansion to be passed the location of the card table base, hopefully in a register. Surely we have to do _something_ sensible here. Adding a load latency to every oop store would be a Bad Thing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3077937485 From duke at openjdk.org Wed Jul 16 10:43:34 2025 From: duke at openjdk.org (Jatin Bhateja) Date: Wed, 16 Jul 2025 10:43:34 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Refine lower bound computation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23947/files - new: https://git.openjdk.org/jdk/pull/23947/files/06eafe77..4f33d4b4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=14 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=13-14 Stats: 9 lines in 1 file changed: 4 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/23947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23947/head:pull/23947 PR: https://git.openjdk.org/jdk/pull/23947 From jbhateja at openjdk.org Wed Jul 16 10:43:35 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 16 Jul 2025 10:43:35 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v14] In-Reply-To: <10jWxhtjQENzTBjlNDFKhHQMN-ioETq3P6_qmVTq3bo=.0124e215-5c09-44c3-8dcb-cd692789907a@github.com> References: <10jWxhtjQENzTBjlNDFKhHQMN-ioETq3P6_qmVTq3bo=.0124e215-5c09-44c3-8dcb-cd692789907a@github.com> Message-ID: <5bnRh3lO2vbvZACelyqV-MuCQ8_1SMFi4PyGIhqWT7Q=.02cdd97d-4d16-40b4-8fa9-fd8a55f850c6@github.com> On Wed, 16 Jul 2025 06:49:06 GMT, Tobias Hartmann wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Broken assertions fix > > Thanks @jatin-bhateja. Isn't the OCA signature status verification independent of the PR? Let me ping a few people here to get it done. Hi @TobiHartmann . I have pushed a change, kindly verify with latest version. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3077983957 From mli at openjdk.org Wed Jul 16 10:48:23 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 16 Jul 2025 10:48:23 GMT Subject: RFR: 8362284: RISC-V: cleanup NativeMovRegMem [v2] In-Reply-To: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> References: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> Message-ID: > Hi, > Can you help to review this simple patch? > > NativeMovRegMem on riscv is actually dead code, but still needed in case of compilation of C1. > So make the code as simple as possible to avoid any reading and maintainance effort. > > No tests, as `offset()` and `set_offset()` were Unimplemented and used in C1 and never triggered before. > > Thanks! Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: use nullptr ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26328/files - new: https://git.openjdk.org/jdk/pull/26328/files/48c55caa..de5fa377 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26328&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26328&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26328.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26328/head:pull/26328 PR: https://git.openjdk.org/jdk/pull/26328 From mli at openjdk.org Wed Jul 16 10:48:23 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 16 Jul 2025 10:48:23 GMT Subject: RFR: 8362284: RISC-V: cleanup NativeMovRegMem [v2] In-Reply-To: References: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> Message-ID: On Wed, 16 Jul 2025 09:24:20 GMT, Fei Yang wrote: > Thanks for the cleanup. Looks fine modulo one minor comment. Thank you! > src/hotspot/cpu/riscv/nativeInst_riscv.hpp line 232: > >> 230: inline NativeMovRegMem* nativeMovRegMem_at(address addr) { >> 231: Unimplemented(); >> 232: return (NativeMovRegMem*)0; > > Maybe: `return (NativeMovRegMem*)nullptr;` fixed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26328#issuecomment-3077996498 PR Review Comment: https://git.openjdk.org/jdk/pull/26328#discussion_r2209970858 From jbhateja at openjdk.org Wed Jul 16 10:53:49 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 16 Jul 2025 10:53:49 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 10:43:34 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Refine lower bound computation Quick note on C2 Integral types - Integral types now encapsulate 3 lattice structures perf TypeInt/Long, namely signed, unsigned and knownbits. - All three lattice values are in sync post-canonicalization. - Lattice is a partial order relation, i.e., reflexive, transitive but anti-symmetric. - An integral lattice contains two special values: a TOP (no value, no assumption can be drawn by the compiler) and BOTTOM (all possible values in the value range) - Verification ensures that the lattice is symmetrical around the centerline, i.e., a semi-lattice. - For a symmetrical lattice, only one operation i.e., meet/join is sufficient for value resolution; other operations can be computed by taking the dual of the first one using de-Morgan's law. _join = Dual (meet (dual(type1), dual(type2))_ - In theory, meet b/w two lattice points takes us to the greatest lower bound in the Hesse diagram, while join b/w two lattice points takes us to the lowest upper bound. Also, TOP represents the entire value range of the lattice, while BOTTOM represents no value, but C2 follows an inverted lattice convention. Inverted integral lattice hasse diagram TOP (no value) / | | \ MIN ??.. MAX \ | | / BOTTOM (all possible values) - Thus, a MEET of two lattice points takes us to the greatest upper bound of the lattice structure; in this case, it's the union of two lattice points i.e., we pick the minimum of the lower bounds and max of the upper bounds of participating lattice points. JOIN takes us to the lowest upper-bound lattice points of the inverted lattice structure. in this case, it will be an intersection of lattice points, which constrains the value range i.e., we pick the max of the lower bounds and the minimum of the upper bounds of the two participating integral lattice points. - e.g., if TypeInt t1 = {lo:10, hi:100} and TypeInt t2 = {lo:1, hi:20}, then t1.meet.(t2) = lowest upper bound. = { lo = min(t1.lo, t2.lo}, hi = max(t1.hi , t2.hi}} = { lo = min(10, 1), hi = max(100, 20)} = { lo = 1, hi = 100} t1.join(t2) = dual (meet (dual(t1), dual(t2)) where dual = {lo : hi} => {hi : lo} = dual (meet (dual {lo : 10, hi : 100}, dual {lo : 1, hi : 20}}) = dual (meet ({lo: 100, hi : 10}}, {lo:20: , hi:1}) = dual (min(t1.lo, t2.lo}, max(t1.hi, t2.hi}) = dual (min(100, 20), max(10, 1)) = dual (lo:20, hi:10} = (lo : 10, hi : 20) Additional identities :- ? TOP meet VAL = VAL since we cannot move to any other greatest lower bound when one of the inputs is TOP (unknown value), to move to the greatest lower bound both the inputs must be known values. ? BOTTOM meet VAL = BOTTOM Now, some quick notes on CCP ? Optimistic data flow analysis using ROPT walk on the ideal graph. ? Each lattice begins with a TOP value, and analysis progressively adds elements to the lattice. Analysis expects to expand the value range with each data flow iteration, thereby monotonically increasing the lattice set. After each value transformation, type verification checks that the new value is greater than the old value in the lattice, in other words new value should dominate the old value in the hasse diagram of the lattice. Thus, tnew->meet(told) gives us the lowest upper bound of two lattice points, i.e., tnew should be a superset of told. CCP is an optimistic iterative data flow analysis which traverses the ideal graph in RPOT order and reaches a fixed point once value transforms as no side-effects on the graph. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3078010698 From aph at openjdk.org Wed Jul 16 11:34:45 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 16 Jul 2025 11:34:45 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: <7Az-yP8D4rH5uwzgkQMaFcK-t0cD8wMMlzjKIRXDvis=.9695b1bc-c9e0-490f-9371-41840cfc0a42@github.com> References: <8kuDtuUPOl5CsWzgmgN9V0X5hXmGUExY4rpOfAfn1ic=.b7023a27-6362-4a7f-ba77-05cc0b50e5e3@github.com> <7Az-yP8D4rH5uwzgkQMaFcK-t0cD8wMMlzjKIRXDvis=.9695b1bc-c9e0-490f-9371-41840cfc0a42@github.com> Message-ID: On Wed, 16 Jul 2025 08:35:38 GMT, Bhavana Kilambi wrote: >> But why would the Neon implementation fail if UseSVE ==1? Surely it would still work, and if it still works this comment is wrong. > > @theRealAph Thanks for your comments. The Neon implementation would not fail if UseSVE == 1 (Does the comment imply something like this?). Only that we are making a choice of generating Neon instructions for UseSVE = 1 and vec_len = 16. > > The conditions that can reach this method are - UseSVE = 0, 1 with vec_len = 8 or 16 and UseSVE = 2 with any vec_len (based on the conditions in `Matcher::match_rule_supported_vector()`). We have already filtered out `UseSVE = 1` with `vec_len = 8` and `UseSVE = 2` at line #2904. So if the control reaches #2915 then it's either `UseSVE = 0` with any vec_len and `UseSVE = 1` with `vec_len = 16` and that's what the comment mentions. Then that's what your comment should say. It does not: it says "conditions must satisfy" which implies that this is something the following code needs for correct operation. The language in your reply here is fine. Say that instead of "one of these conditions must satisfy". ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2210089819 From bkilambi at openjdk.org Wed Jul 16 11:38:45 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Wed, 16 Jul 2025 11:38:45 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: <8kuDtuUPOl5CsWzgmgN9V0X5hXmGUExY4rpOfAfn1ic=.b7023a27-6362-4a7f-ba77-05cc0b50e5e3@github.com> <7Az-yP8D4rH5uwzgkQMaFcK-t0cD8wMMlzjKIRXDvis=.9695b1bc-c9e0-490f-9371-41840cfc0a42@github.com> Message-ID: <8EA4qZsKKGihKr2rRfVp1hC4wdVkpvkirEMmCd-xL6o=.9deaf8bc-0185-4a9e-8643-b31f0b6d311a@github.com> On Wed, 16 Jul 2025 11:31:53 GMT, Andrew Haley wrote: >> @theRealAph Thanks for your comments. The Neon implementation would not fail if UseSVE == 1 (Does the comment imply something like this?). Only that we are making a choice of generating Neon instructions for UseSVE = 1 and vec_len = 16. >> >> The conditions that can reach this method are - UseSVE = 0, 1 with vec_len = 8 or 16 and UseSVE = 2 with any vec_len (based on the conditions in `Matcher::match_rule_supported_vector()`). We have already filtered out `UseSVE = 1` with `vec_len = 8` and `UseSVE = 2` at line #2904. So if the control reaches #2915 then it's either `UseSVE = 0` with any vec_len and `UseSVE = 1` with `vec_len = 16` and that's what the comment mentions. > > Then that's what your comment should say. It does not: it says "conditions must satisfy" which implies that this is something the following code needs for correct operation. > > The language in your reply here is fine. Say that instead of "one of these conditions must satisfy". Thanks for the suggestion. Will do that in my next PS. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2210111107 From chagedorn at openjdk.org Wed Jul 16 11:56:49 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 16 Jul 2025 11:56:49 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v37] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 09:38:18 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > test failures Great! Sure, I've submitted another round of testing. Will report back again. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-3078236848 From bulasevich at openjdk.org Wed Jul 16 12:01:46 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 16 Jul 2025 12:01:46 GMT Subject: RFR: 8362250: ARM32: forward_exception_entry missing return address [v2] In-Reply-To: References: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> Message-ID: On Tue, 15 Jul 2025 17:42:27 GMT, Boris Ulasevich wrote: >> The ARM32 ForwardExceptionNode codegen needs to set the exception address to R5. And, since the https://github.com/openjdk/jdk/pull/20437 change, the TailCall generator does not need this because the StubRoutines::forward_exception_entry function is not called there. > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > adjust ad rules format Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26312#issuecomment-3078253484 From bulasevich at openjdk.org Wed Jul 16 12:01:47 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 16 Jul 2025 12:01:47 GMT Subject: Integrated: 8362250: ARM32: forward_exception_entry missing return address In-Reply-To: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> References: <4O9aorKuQ5wpIGNVsjHd8K8lIQR-uRDxEl7HsFuUyXk=.e9c48f43-f406-4540-a231-cae9bdfc0f11@github.com> Message-ID: On Tue, 15 Jul 2025 10:23:23 GMT, Boris Ulasevich wrote: > The ARM32 ForwardExceptionNode codegen needs to set the exception address to R5. And, since the https://github.com/openjdk/jdk/pull/20437 change, the TailCall generator does not need this because the StubRoutines::forward_exception_entry function is not called there. This pull request has now been integrated. Changeset: 6ed81641 Author: Boris Ulasevich URL: https://git.openjdk.org/jdk/commit/6ed81641b101658fbbd35445b6dd74ec17fc20f3 Stats: 9 lines in 1 file changed: 2 ins; 5 del; 2 mod 8362250: ARM32: forward_exception_entry missing return address Reviewed-by: shade ------------- PR: https://git.openjdk.org/jdk/pull/26312 From fyang at openjdk.org Wed Jul 16 12:33:44 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 16 Jul 2025 12:33:44 GMT Subject: RFR: 8362284: RISC-V: cleanup NativeMovRegMem [v2] In-Reply-To: References: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> Message-ID: On Wed, 16 Jul 2025 10:48:23 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this simple patch? >> >> NativeMovRegMem on riscv is actually dead code, but still needed in case of compilation of C1. >> So make the code as simple as possible to avoid any reading and maintainance effort. >> >> No tests, as `offset()` and `set_offset()` were Unimplemented and used in C1 and never triggered before. >> >> Thanks! > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > use nullptr Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26328#pullrequestreview-3024690862 From bmaillard at openjdk.org Wed Jul 16 12:53:00 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 16 Jul 2025 12:53:00 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist Message-ID: This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. ### Testing - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) - [x] tier1-3, plus some internal testing - [x] Added test from the fuzzer Thank you for reviewing! ------------- Commit messages: - 8361700: Add comment for reference to the optimization - 8361700: Add test obtained from the fuzzer - 8361700: Add RShift nodes to worklist when candidates for mask and shift ideal optimization Changes: https://git.openjdk.org/jdk/pull/26347/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26347&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361700 Stats: 69 lines in 2 files changed: 69 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26347.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26347/head:pull/26347 PR: https://git.openjdk.org/jdk/pull/26347 From thartmann at openjdk.org Wed Jul 16 13:02:54 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 13:02:54 GMT Subject: RFR: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI In-Reply-To: References: Message-ID: <-x7-wP0qjhmk3PxdON7RftIv_oPWu3bBNmKjUCP-bCc=.e97704bb-df34-4f1c-9085-b47c7486c553@github.com> On Tue, 15 Jul 2025 21:52:23 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a small fix for an assert failure in SuperWord truncation with ModI nodes. The failure itself is harmless and shouldn't lead to any miscompilations in product mode. I've added `ModI` to the assert switch and adapted the test in the bug report. Let me know what you think! Testing is all clean. Ship it! :slightly_smiling_face: ------------- PR Comment: https://git.openjdk.org/jdk/pull/26334#issuecomment-3078413666 From jkarthikeyan at openjdk.org Wed Jul 16 13:02:55 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 16 Jul 2025 13:02:55 GMT Subject: RFR: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI In-Reply-To: <-x7-wP0qjhmk3PxdON7RftIv_oPWu3bBNmKjUCP-bCc=.e97704bb-df34-4f1c-9085-b47c7486c553@github.com> References: <-x7-wP0qjhmk3PxdON7RftIv_oPWu3bBNmKjUCP-bCc=.e97704bb-df34-4f1c-9085-b47c7486c553@github.com> Message-ID: <_bmy_qwlDUmvAtArdSmfSnKlUyppVieqGe-HkfY9pNA=.6b4e2241-bdc2-43c6-89f6-0b65fe8b5f46@github.com> On Wed, 16 Jul 2025 12:43:09 GMT, Tobias Hartmann wrote: >> Hi all, >> This is a small fix for an assert failure in SuperWord truncation with ModI nodes. The failure itself is harmless and shouldn't lead to any miscompilations in product mode. I've added `ModI` to the assert switch and adapted the test in the bug report. Let me know what you think! > > Testing is all clean. Ship it! :slightly_smiling_face: Thanks for the reviews @TobiHartmann @chhagedorn! :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26334#issuecomment-3078481539 From jkarthikeyan at openjdk.org Wed Jul 16 13:02:55 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 16 Jul 2025 13:02:55 GMT Subject: Integrated: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 21:52:23 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a small fix for an assert failure in SuperWord truncation with ModI nodes. The failure itself is harmless and shouldn't lead to any miscompilations in product mode. I've added `ModI` to the assert switch and adapted the test in the bug report. Let me know what you think! This pull request has now been integrated. Changeset: 70c1ff7e Author: Jasmine Karthikeyan URL: https://git.openjdk.org/jdk/commit/70c1ff7e1505eee11b2a9acd9e94a39cd2c9a932 Stats: 14 lines in 2 files changed: 13 ins; 0 del; 1 mod 8362171: C2 fails with unexpected node in SuperWord truncation: ModI Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk/pull/26334 From thartmann at openjdk.org Wed Jul 16 13:40:00 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 13:40:00 GMT Subject: [jdk25] RFR: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI Message-ID: Hi all, This pull request contains a backport of commit [70c1ff7e](https://github.com/openjdk/jdk/commit/70c1ff7e1505eee11b2a9acd9e94a39cd2c9a932) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. The commit being backported was authored by Jasmine Karthikeyan on 16 Jul 2025 and was reviewed by Tobias Hartmann and Christian Hagedorn. Thanks! ------------- Commit messages: - Backport 70c1ff7e1505eee11b2a9acd9e94a39cd2c9a932 Changes: https://git.openjdk.org/jdk/pull/26350/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26350&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362171 Stats: 14 lines in 2 files changed: 13 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26350.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26350/head:pull/26350 PR: https://git.openjdk.org/jdk/pull/26350 From bmaillard at openjdk.org Wed Jul 16 13:49:56 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 16 Jul 2025 13:49:56 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods [v2] In-Reply-To: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> Message-ID: <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> > This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. > > ## Analysis > > We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: > > ```c++ > if (!ci_env.failing() && !task->is_success()) { > assert(ci_env.failure_reason() != nullptr, "expect failure reason"); > assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); > // The compiler elected, without comment, not to register a result. > // Do not attempt further compilations of this method. > ci_env.record_method_not_compilable("compile failed"); > } > > > The `task->is_success()` call accesses the private `_is_success` field. > This field is modified in `CompileTask::mark_success`. > > By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: > > CompileTask::mark_success compileTask.hpp:185 > nmethod::post_compiled_method nmethod.cpp:2212 > ciEnv::register_method ciEnv.cpp:1127 > Compilation::install_code c1_Compilation.cpp:425 > Compilation::compile_method c1_Compilation.cpp:488 > Compilation::Compilation c1_Compilation.cpp:609 > Compiler::compile_method c1_Compiler.cpp:262 > CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 > CompileBroker::compiler_thread_loop compileBroker.cpp:1968 > CompilerThread::thread_entry compilerThread.cpp:67 > JavaThread::thread_main_inner javaThread.cpp:773 > JavaThread::run javaThread.cpp:758 > Thread::call_run thread.cpp:243 > thread_native_entry os_linux.cpp:868 > > > We go up the stacktrace and see that in `Compilation::compile_method` we have: > > ```c++ > if (should_install_code()) { > // install code > PhaseTraceTime timeit(_t_codeinstall); > install_code(frame_size); > } > > > If we do not install methods after compilation, the code path that marks the success is never executed > and therefore results in hitting the assert. > > ### Fix > We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. > After careful consideration, it was decided to simply get rid of the `-XX:-InstallMethods` flag. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) > - [x] tier1-3, plus some internal testing > - [x] Added a test that starts the VM with the `-XX:-InstallMethods` flag > > ... Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: - 8358573: get rid of InstallMethods flags completely - Revert "8358573: Add missing task success notification" This reverts commit cd91c7c06ba05aba3500b95ba1317539363aa63c. - Revert "8358573: Add test for -XX:-InstallMethods" This reverts commit 6eab84718c3b60c2585bc2711c4bc8144472975b. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26310/files - new: https://git.openjdk.org/jdk/pull/26310/files/6eab8471..2da73a5d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26310&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26310&range=00-01 Stats: 54 lines in 4 files changed: 0 ins; 53 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26310.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26310/head:pull/26310 PR: https://git.openjdk.org/jdk/pull/26310 From bmaillard at openjdk.org Wed Jul 16 13:49:56 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 16 Jul 2025 13:49:56 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods [v2] In-Reply-To: References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> Message-ID: On Tue, 15 Jul 2025 15:45:10 GMT, Marc Chevalier wrote: >> Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: >> >> - 8358573: get rid of InstallMethods flags completely >> - Revert "8358573: Add missing task success notification" >> >> This reverts commit cd91c7c06ba05aba3500b95ba1317539363aa63c. >> - Revert "8358573: Add test for -XX:-InstallMethods" >> >> This reverts commit 6eab84718c3b60c2585bc2711c4bc8144472975b. > > Nice description! I have reverted the changes and removed the `-XX:-InstallMethods` flag. Thank you @marc-chevalier, @dean-long and @TobiHartmann for your comments! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26310#issuecomment-3078713061 From chagedorn at openjdk.org Wed Jul 16 14:24:40 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 16 Jul 2025 14:24:40 GMT Subject: [jdk25] RFR: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 13:34:59 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [70c1ff7e](https://github.com/openjdk/jdk/commit/70c1ff7e1505eee11b2a9acd9e94a39cd2c9a932) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Jasmine Karthikeyan on 16 Jul 2025 and was reviewed by Tobias Hartmann and Christian Hagedorn. > > Thanks! Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26350#pullrequestreview-3025273815 From thartmann at openjdk.org Wed Jul 16 14:27:42 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 14:27:42 GMT Subject: [jdk25] RFR: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI In-Reply-To: References: Message-ID: <4BVmGbWHZUg2ZeJ0rKMmgi23fY0QTxPjLgAsMpzjAKo=.46ce4ef3-1b65-43c0-bcf6-8552b0fe2808@github.com> On Wed, 16 Jul 2025 13:34:59 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [70c1ff7e](https://github.com/openjdk/jdk/commit/70c1ff7e1505eee11b2a9acd9e94a39cd2c9a932) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Jasmine Karthikeyan on 16 Jul 2025 and was reviewed by Tobias Hartmann and Christian Hagedorn. > > Thanks! Thanks Christian! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26350#issuecomment-3078845879 From mchevalier at openjdk.org Wed Jul 16 14:31:49 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 16 Jul 2025 14:31:49 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods [v2] In-Reply-To: <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> Message-ID: On Wed, 16 Jul 2025 13:49:56 GMT, Beno?t Maillard wrote: >> This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. >> >> ## Analysis >> >> We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: >> >> ```c++ >> if (!ci_env.failing() && !task->is_success()) { >> assert(ci_env.failure_reason() != nullptr, "expect failure reason"); >> assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); >> // The compiler elected, without comment, not to register a result. >> // Do not attempt further compilations of this method. >> ci_env.record_method_not_compilable("compile failed"); >> } >> >> >> The `task->is_success()` call accesses the private `_is_success` field. >> This field is modified in `CompileTask::mark_success`. >> >> By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: >> >> CompileTask::mark_success compileTask.hpp:185 >> nmethod::post_compiled_method nmethod.cpp:2212 >> ciEnv::register_method ciEnv.cpp:1127 >> Compilation::install_code c1_Compilation.cpp:425 >> Compilation::compile_method c1_Compilation.cpp:488 >> Compilation::Compilation c1_Compilation.cpp:609 >> Compiler::compile_method c1_Compiler.cpp:262 >> CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 >> CompileBroker::compiler_thread_loop compileBroker.cpp:1968 >> CompilerThread::thread_entry compilerThread.cpp:67 >> JavaThread::thread_main_inner javaThread.cpp:773 >> JavaThread::run javaThread.cpp:758 >> Thread::call_run thread.cpp:243 >> thread_native_entry os_linux.cpp:868 >> >> >> We go up the stacktrace and see that in `Compilation::compile_method` we have: >> >> ```c++ >> if (should_install_code()) { >> // install code >> PhaseTraceTime timeit(_t_codeinstall); >> install_code(frame_size); >> } >> >> >> If we do not install methods after compilation, the code path that marks the success is never executed >> and therefore results in hitting the assert. >> >> ### Fix >> We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. >> After careful consideration, it was decided to simply get rid of the `-XX:-InstallMethods` flag. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) >> - [... > > Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: > > - 8358573: get rid of InstallMethods flags completely > - Revert "8358573: Add missing task success notification" > > This reverts commit cd91c7c06ba05aba3500b95ba1317539363aa63c. > - Revert "8358573: Add test for -XX:-InstallMethods" > > This reverts commit 6eab84718c3b60c2585bc2711c4bc8144472975b. Now, `should_install_code` is `false` only for `RepeatCompilation`, fine with me. The fix looks good, but maybe it's worth changing the title of the issue/PR? Even if it still solves the problem mentioned in title, I'd say that's not the best description of what you're doing. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26310#issuecomment-3078862949 From mchevalier at openjdk.org Wed Jul 16 14:37:41 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 16 Jul 2025 14:37:41 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 12:42:32 GMT, Beno?t Maillard wrote: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. > > The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. > > The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. > > Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer > > Thank you for reviewing! Yet another of this kind. Works for me! Would be good if we could declare what shape we are looking for, that would be used by the node idealization, and by the IGVN witchcraft to guess how deep nodes look, and update the relevant nodes automatically, without having to make it manually. I think I've read some similar vague idea before in a JBS issue, maybe from Roland? src/hotspot/share/opto/phaseX.cpp line 2552: > 2550: } > 2551: } > 2552: // If changed AndI/AndL inputs, check RShift users for "(x & mask) >> shift" reordering "reordering" sounds not quite right to me, but I don't have a much better idea. ------------- Marked as reviewed by mchevalier (Committer). PR Review: https://git.openjdk.org/jdk/pull/26347#pullrequestreview-3025313234 PR Review Comment: https://git.openjdk.org/jdk/pull/26347#discussion_r2210610118 From aph at openjdk.org Wed Jul 16 14:44:43 2025 From: aph at openjdk.org (Andrew Haley) Date: Wed, 16 Jul 2025 14:44:43 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet In-Reply-To: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 26 Jun 2025 12:13:19 GMT, Samuel Chee wrote: > AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg: > > ;; cmpxchg { > 0x0000e708d144cf60: mov x8, x2 > 0x0000e708d144cf64: casal x8, x3, [x0] > 0x0000e708d144cf68: cmp x8, x2 > ;; 0x1F1F1F1F1F1F1F1F > 0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f > ;; } cmpxchg > 0x0000e708d144cf70: cset x8, ne // ne = any > 0x0000e708d144cf74: dmb ish > > > According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2] > >> Atomically sets the value of a variable to the >> newValue with the memory semantics of setVolatile if >> the variable's current value, referred to as the witness >> value, == the expectedValue, as accessed with the memory >> semantics of getVolatile. > > > > Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen. > > Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant) > > This is also reflected by C2 not having a dmb for the same respective method. > > [1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) > [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...) I think we still need a DMB after non-LSE CMPXCHG, which gets failures without this DMB: AArch64 MP { 0:X0=x; 0:X2=y; 1:X0=y; 1:X4=x; } P0 | P1 ; LDAR W1,[X0] | MOV W2,#1 ; | L0: ; LDR W3,[X2] | LDAXR W1,[X0] ; | STLXR W8,W2,[X0] ; | CBNZ W8,L0; | DMB ISH; | MOV W3,#1 ; | STR W3,[X4] ; exists (0:X1=1 /\ 0:X3=0 /\ 1:X1=0) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3078920694 From thartmann at openjdk.org Wed Jul 16 14:53:44 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 16 Jul 2025 14:53:44 GMT Subject: [jdk25] Integrated: 8362171: C2 fails with unexpected node in SuperWord truncation: ModI In-Reply-To: References: Message-ID: <8JcQuNToKOJz04zWf6x7wPV4j8gssC99g6VSeA1qRqM=.a7bb7e7f-693e-4eb2-a857-2bb85055e93f@github.com> On Wed, 16 Jul 2025 13:34:59 GMT, Tobias Hartmann wrote: > Hi all, > > This pull request contains a backport of commit [70c1ff7e](https://github.com/openjdk/jdk/commit/70c1ff7e1505eee11b2a9acd9e94a39cd2c9a932) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Jasmine Karthikeyan on 16 Jul 2025 and was reviewed by Tobias Hartmann and Christian Hagedorn. > > Thanks! This pull request has now been integrated. Changeset: b67fb82a Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/b67fb82a03cdb9634f71c0c39722611c852ade50 Stats: 14 lines in 2 files changed: 13 ins; 0 del; 1 mod 8362171: C2 fails with unexpected node in SuperWord truncation: ModI Reviewed-by: chagedorn Backport-of: 70c1ff7e1505eee11b2a9acd9e94a39cd2c9a932 ------------- PR: https://git.openjdk.org/jdk/pull/26350 From fgao at openjdk.org Wed Jul 16 14:53:50 2025 From: fgao at openjdk.org (Fei Gao) Date: Wed, 16 Jul 2025 14:53:50 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> Message-ID: <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> On Wed, 16 Jul 2025 06:44:13 GMT, Xiaohong Gong wrote: > * case-2: 2 times of gather and merge > > * Can be refined. But the `LoadVectorGatherNode` should be changed to accept 2 `index` vectors. > * case-3: 4 times of gather and merge (only for byte) > > * Can be refined. We can implement it just like: > step-1: `v1 = gather1 + gather2 + 2 * uzp1` // merging the first and second gather-loads > step-2: `v2 = gather3 + gather4 + 2 * uzp1` // merging the third and fourth gather-loads > step-3: `v3 = slice (v2, v2)`, `v = or(v1, v3)` // do the final merging > We have to change `LoadVectorGatherNode` as well. At least making it accept 2 `index` vectors. > > As a summary, `LoadVectorGatherNode` will be more complex than before. But the good thing is, giving it one more `index` input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks! @XiaohongGong thanks for your reply. This idea generally looks good to me. For case-2, we have gather1 + gather2 + uzp1: [0a 0a 0a 0a ... 0a 0a 0a 0a] [0b 0b 0b 0b ... 0b 0b 0b 0b] uzp1.H => [bb bb bb bb ... aa aa aa aa] Can we improve `case-3` by following the pattern of `case-2`? step-1: v1 = gather1 + gather2 + uzp1 [000a 000a 000a 000a ? 000a 000a 000a 000a] [000b 000b 000b 000b ? 000b 000b 000b 000b] uzp1.H => [0b0b 0b0b 0b0b 0b0b ? 0a0a 0a0a 0a0a 0a0a] step-2: v2 = gather3 + gather4 + uzp1 [000c 000c 000c 000c ? 000c 000c 000c 000c] [000d 000d 000d 000d ? 000d 000d 000d 000d] uzp1.H => [0d0d 0d0d 0d0d 0d0d ? 0c0c 0c0c 0c0c 0c0c] step-3: v3 = uzp1 (v1, v2) [0b0b 0b0b 0b0b 0b0b ? 0a0a 0a0a 0a0a 0a0a] [0d0d 0d0d 0d0d 0d0d ? 0c0c 0c0c 0c0c 0c0c] uzp1.B => [dddd dddd cccc cccc ? bbbb bbbb aaaa aaaa] Then we can also consistently define the semantics of `LoadVectorGatherNode` as `gather1 + gather2 + uzp1.H `, which would make backend much cleaner. WDYT? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3078968856 From mhaessig at openjdk.org Wed Jul 16 14:58:45 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 16 Jul 2025 14:58:45 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 12:42:32 GMT, Beno?t Maillard wrote: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. > > The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. > > The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. > > Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer > > Thank you for reviewing! Thank you for working on this, @benoitmaillard. I only have a few nits. Otherwise, this looks good to me. test/hotspot/jtreg/compiler/c2/TestMaskAndRShiftReorder.java line 28: > 26: * @bug 8361700 > 27: * @summary An expression of the form "(x & mask) >> shift", where the mask > 28: * is a constant, should be reordered to "(x >> shift) & (mask >> shift)" Suggestion: * is a constant, should be transformed to "(x >> shift) & (mask >> shift)" I agree with @marc-chevalier that "reordered" is not the right word. test/hotspot/jtreg/compiler/c2/TestMaskAndRShiftReorder.java line 60: > 58: return iArr.length; > 59: } > 60: } Suggestion: } ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26347#pullrequestreview-3025384023 PR Review Comment: https://git.openjdk.org/jdk/pull/26347#discussion_r2210669477 PR Review Comment: https://git.openjdk.org/jdk/pull/26347#discussion_r2210671082 From mhaessig at openjdk.org Wed Jul 16 14:58:46 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 16 Jul 2025 14:58:46 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 14:31:19 GMT, Marc Chevalier wrote: >> This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. >> >> The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. >> >> The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. >> >> Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) >> - [x] tier1-3, plus some internal testing >> - [x] Added test from the fuzzer >> >> Thank you for reviewing! > > src/hotspot/share/opto/phaseX.cpp line 2552: > >> 2550: } >> 2551: } >> 2552: // If changed AndI/AndL inputs, check RShift users for "(x & mask) >> shift" reordering > > "reordering" sounds not quite right to me, but I don't have a much better idea. Suggestion: // If changed AndI/AndL inputs, check RShift users for "(x & mask) >> shift" optimization opportunity Those are my two cents ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26347#discussion_r2210649883 From bmaillard at openjdk.org Wed Jul 16 15:07:59 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Wed, 16 Jul 2025 15:07:59 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v2] In-Reply-To: References: Message-ID: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. > > The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. > > The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. > > Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer > > Thank you for reviewing! Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: - Update test/hotspot/jtreg/compiler/c2/TestMaskAndRShiftReorder.java Co-authored-by: Manuel H?ssig - Update test/hotspot/jtreg/compiler/c2/TestMaskAndRShiftReorder.java Co-authored-by: Manuel H?ssig - Update src/hotspot/share/opto/phaseX.cpp Co-authored-by: Manuel H?ssig ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26347/files - new: https://git.openjdk.org/jdk/pull/26347/files/ac389939..cc3ccc93 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26347&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26347&range=00-01 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/26347.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26347/head:pull/26347 PR: https://git.openjdk.org/jdk/pull/26347 From bulasevich at openjdk.org Wed Jul 16 16:31:23 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Wed, 16 Jul 2025 16:31:23 GMT Subject: [jdk25] RFR: 8362250: ARM32: forward_exception_entry missing return address Message-ID: This pull request contains a backport of commit [6ed81641](https://github.com/openjdk/jdk/commit/6ed81641b101658fbbd35445b6dd74ec17fc20f3) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. The ARM32 ForwardExceptionNode code generation has been updated to set the exception address. This is a minimal, ARM32-specific change, it fixes a couple of failing hotspot jtreg tests. ------------- Commit messages: - Backport 6ed81641b101658fbbd35445b6dd74ec17fc20f3 Changes: https://git.openjdk.org/jdk/pull/26352/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26352&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362250 Stats: 9 lines in 1 file changed: 2 ins; 5 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26352.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26352/head:pull/26352 PR: https://git.openjdk.org/jdk/pull/26352 From kvn at openjdk.org Wed Jul 16 16:50:42 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 16 Jul 2025 16:50:42 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 [v2] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 08:59:17 GMT, Aleksey Shipilev wrote: >> See the bug for more analysis. >> >> The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. >> >> There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. >> >> I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. >> >> This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. >> >> Additional testing: >> - [x] Linux AArch64 server fastdebug, `tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Also handle the corner case when compiler threads might be using the task Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26294#pullrequestreview-3025883834 From shade at openjdk.org Wed Jul 16 17:26:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 16 Jul 2025 17:26:39 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 [v2] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 08:59:17 GMT, Aleksey Shipilev wrote: >> See the bug for more analysis. >> >> The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. >> >> There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. >> >> I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. >> >> This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. >> >> Additional testing: >> - [x] Linux AArch64 server fastdebug, `tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Also handle the corner case when compiler threads might be using the task Thanks! I think I need another Review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26294#issuecomment-3079552309 From shade at openjdk.org Wed Jul 16 17:27:39 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 16 Jul 2025 17:27:39 GMT Subject: [jdk25] RFR: 8362250: ARM32: forward_exception_entry missing return address In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 16:18:04 GMT, Boris Ulasevich wrote: > This pull request contains a backport of commit [6ed81641](https://github.com/openjdk/jdk/commit/6ed81641b101658fbbd35445b6dd74ec17fc20f3) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The ARM32 ForwardExceptionNode code generation has been updated to set the exception address. This is a minimal, ARM32-specific change, it fixes a couple of failing hotspot jtreg tests. Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26352#pullrequestreview-3026033157 From duke at openjdk.org Wed Jul 16 17:37:54 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Wed, 16 Jul 2025 17:37:54 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v7] In-Reply-To: References: <72OW9wHbET022fBnWx1Wdxb_J9pbH2sLiAqlC9fGb-c=.6930c0b1-33bb-4c49-af02-11e2c79dbaf2@github.com> Message-ID: <6aQO8OMKcpAUSA9VzJICZXL0YY9zTQzVYDh0NuxueM8=.fc8ca8ce-6619-43dd-ac18-9178f1cb6007@github.com> On Wed, 16 Jul 2025 06:59:06 GMT, Dean Long wrote: > I can't find who posted this and what lines it refers to. If it refers to nmethod::relocate, I don't think the lock is needed after 8358821, because nobody will be patching the relocations. The [comment](https://github.com/openjdk/jdk/pull/23573#issuecomment-2831542576) was from @fisk on an old revision but it was because `NMethodState_lock` was being held for the entirety of `nmethod::relocate` as opposed to just when the the state is updated. However I'm curious about the case where the nmethod we are attempting to relocate gets updated. For example 1. Call nmethod relocate 2. Check is_in_use() 3. Original nmethod marked not entrant from somewhere else in JVM 4. Perform relocation on stale nmethod I believe we need some way to guarantee the source does not change during relocation. > The source nmethod? I don't see how that would cause a problem for that small block of code. All it does to the source is call make_not_used(). I'm interested in the case where something else invalidates the source nmethod during relocation. It can't be evicted from the code cache because the `CodeCache_lock` is held during relocation but that doesn't stop another thread from marking it not entrant ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3079589572 From sviswanathan at openjdk.org Wed Jul 16 20:57:47 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Wed, 16 Jul 2025 20:57:47 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v4] In-Reply-To: <-V4hpHvXdaDjmEyYzHcEpDJ2bzPTqoz2Ao8FLobkmB8=.d9e3b962-ae8d-4e4b-8ddb-c3ab42a2a619@github.com> References: <-V4hpHvXdaDjmEyYzHcEpDJ2bzPTqoz2Ao8FLobkmB8=.d9e3b962-ae8d-4e4b-8ddb-c3ab42a2a619@github.com> Message-ID: On Wed, 16 Jul 2025 00:06:53 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - merge with master > - remove pushp/popp from vm_version_x86 and also when APX is not being used > - rename to paired_push and paired_pop > - 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs src/hotspot/cpu/x86/macroAssembler_x86.cpp line 798: > 796: } > 797: > 798: void MacroAssembler::paired_push(Register src) { Would be better to call these as push_ppx and pop_ppx. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2211592540 From dlong at openjdk.org Wed Jul 16 21:29:01 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 16 Jul 2025 21:29:01 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v37] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 20:34:42 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: > > Revert is_always_within_branch_range changes In jdk25, if stale nmethod was marked as not entrant, we would patch the verified entry point, but we got rid of that in jdk26. So at least in jdk26, make_not_entrant() shouldn't be changing the nmethod much if at all. But let's say another thread is trying to mark the source nmethod as not entrant while nmethod::relocate is running, or soon after. What is the desired outcome for the newly relocated nmethod? It seems like any call to make_not_entrant() on the source would also want to do the same on the copy, right? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3080964197 From dlong at openjdk.org Wed Jul 16 22:13:59 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 16 Jul 2025 22:13:59 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v37] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 20:34:42 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: > > Revert is_always_within_branch_range changes BTW, I thought there was an earlier discussion that decided relocation would only happen at a safepoint, but now I can't find it. Did I remember wrong? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3081358519 From dlong at openjdk.org Wed Jul 16 23:26:48 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 16 Jul 2025 23:26:48 GMT Subject: RFR: 8358573: CompileBroker fails with "expect failure reason" assert with -XX:-InstallMethods [v2] In-Reply-To: <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> Message-ID: On Wed, 16 Jul 2025 13:49:56 GMT, Beno?t Maillard wrote: >> This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. >> >> ## Analysis >> >> We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: >> >> ```c++ >> if (!ci_env.failing() && !task->is_success()) { >> assert(ci_env.failure_reason() != nullptr, "expect failure reason"); >> assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); >> // The compiler elected, without comment, not to register a result. >> // Do not attempt further compilations of this method. >> ci_env.record_method_not_compilable("compile failed"); >> } >> >> >> The `task->is_success()` call accesses the private `_is_success` field. >> This field is modified in `CompileTask::mark_success`. >> >> By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: >> >> CompileTask::mark_success compileTask.hpp:185 >> nmethod::post_compiled_method nmethod.cpp:2212 >> ciEnv::register_method ciEnv.cpp:1127 >> Compilation::install_code c1_Compilation.cpp:425 >> Compilation::compile_method c1_Compilation.cpp:488 >> Compilation::Compilation c1_Compilation.cpp:609 >> Compiler::compile_method c1_Compiler.cpp:262 >> CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 >> CompileBroker::compiler_thread_loop compileBroker.cpp:1968 >> CompilerThread::thread_entry compilerThread.cpp:67 >> JavaThread::thread_main_inner javaThread.cpp:773 >> JavaThread::run javaThread.cpp:758 >> Thread::call_run thread.cpp:243 >> thread_native_entry os_linux.cpp:868 >> >> >> We go up the stacktrace and see that in `Compilation::compile_method` we have: >> >> ```c++ >> if (should_install_code()) { >> // install code >> PhaseTraceTime timeit(_t_codeinstall); >> install_code(frame_size); >> } >> >> >> If we do not install methods after compilation, the code path that marks the success is never executed >> and therefore results in hitting the assert. >> >> ### Fix >> We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. >> After careful consideration, it was decided to simply get rid of the `-XX:-InstallMethods` flag. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) >> - [... > > Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: > > - 8358573: get rid of InstallMethods flags completely > - Revert "8358573: Add missing task success notification" > > This reverts commit cd91c7c06ba05aba3500b95ba1317539363aa63c. > - Revert "8358573: Add test for -XX:-InstallMethods" > > This reverts commit 6eab84718c3b60c2585bc2711c4bc8144472975b. It might be interesting for maybe a CompileTheWorld mode to be able to turn off should_install_code programmatically, but it doesn't need a global flag to do that. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26310#pullrequestreview-3027224467 From duke at openjdk.org Thu Jul 17 00:02:57 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 17 Jul 2025 00:02:57 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v37] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 21:26:05 GMT, Dean Long wrote: > But let's say another thread is trying to mark the source nmethod as not entrant while nmethod::relocate is running, or soon after. What is the desired outcome for the newly relocated nmethod? It seems like any call to make_not_entrant() on the source would also want to do the same on the copy, right? You're correct I think the code is fine as is. If the source gets marked not entrant we can just mark the copy as not entrant as well. However I do think it is important to require the caller of `nmethod::relocate()` to hold the `CodeCache_lock` instead of acquiring the lock inside of `relocate()`. Otherwise the nmethod that is blocked on the lock could be purged from the code cache ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3081853751 From duke at openjdk.org Thu Jul 17 00:10:58 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 17 Jul 2025 00:10:58 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v37] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 22:11:21 GMT, Dean Long wrote: > BTW, I thought there was an earlier discussion that decided relocation would only happen at a safepoint, but now I can't find it. Did I remember wrong? There was a discussion a while back on whether one was needed or not. Here is the argument against one from https://github.com/openjdk/jdk/pull/23573#issuecomment-2831542576 : > The safepoint is still causing more trouble than it solves. It was introduced due to oop phobia. What the oops really needed to stabilize is to run the entry barrier which you do now. The safepoint merely destabilizes the oops again while introducing latency problems and fun class redefinition interactions. It should be removed as I can't see it serves any purpose. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3081871267 From xgong at openjdk.org Thu Jul 17 01:23:51 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 17 Jul 2025 01:23:51 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> Message-ID: On Wed, 16 Jul 2025 14:49:19 GMT, Fei Gao wrote: > > * case-2: 2 times of gather and merge > > > > * Can be refined. But the `LoadVectorGatherNode` should be changed to accept 2 `index` vectors. > > * case-3: 4 times of gather and merge (only for byte) > > > > * Can be refined. We can implement it just like: > > step-1: `v1 = gather1 + gather2 + 2 * uzp1` // merging the first and second gather-loads > > step-2: `v2 = gather3 + gather4 + 2 * uzp1` // merging the third and fourth gather-loads > > step-3: `v3 = slice (v2, v2)`, `v = or(v1, v3)` // do the final merging > > We have to change `LoadVectorGatherNode` as well. At least making it accept 2 `index` vectors. > > > > As a summary, `LoadVectorGatherNode` will be more complex than before. But the good thing is, giving it one more `index` input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks! > > @XiaohongGong thanks for your reply. > > This idea generally looks good to me. > > For case-2, we have > > ``` > gather1 + gather2 + uzp1: > [0a 0a 0a 0a 0a 0a 0a 0a] > [0b 0b 0b 0b 0b 0b 0b 0b] > uzp1.H => > [bb bb bb bb aa aa aa aa] > ``` > > Can we improve `case-3` by following the pattern of `case-2`? > > ``` > step-1: v1 = gather1 + gather2 + uzp1 > [000a 000a 000a 000a 000a 000a 000a 000a] > [000b 000b 000b 000b 000b 000b 000b 000b] > uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a] > > step-2: v2 = gather3 + gather4 + uzp1 > [000c 000c 000c 000c 000c 000c 000c 000c] > [000d 000d 000d 000d 000d 000d 000d 000d] > uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c] > > step-3: v3 = uzp1 (v1, v2) > [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a] > [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c] > uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa] > ``` > > Then we can also consistently define the semantics of `LoadVectorGatherNode` as `gather1 + gather2 + uzp1.H `, which would make backend much cleaner. WDYT? Thanks! We can write a macro-assembler helper for that. Regarding to the definitation of `LoadVectorGatherNode`, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3082052940 From bulasevich at openjdk.org Thu Jul 17 01:32:54 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 17 Jul 2025 01:32:54 GMT Subject: [jdk25] Integrated: 8362250: ARM32: forward_exception_entry missing return address In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 16:18:04 GMT, Boris Ulasevich wrote: > This pull request contains a backport of commit [6ed81641](https://github.com/openjdk/jdk/commit/6ed81641b101658fbbd35445b6dd74ec17fc20f3) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The ARM32 ForwardExceptionNode code generation has been updated to set the exception address. This is a minimal, ARM32-specific change, it fixes a couple of failing hotspot jtreg tests. This pull request has now been integrated. Changeset: 5129887d Author: Boris Ulasevich URL: https://git.openjdk.org/jdk/commit/5129887dfead268672403265eb4f3795682ca699 Stats: 9 lines in 1 file changed: 2 ins; 5 del; 2 mod 8362250: ARM32: forward_exception_entry missing return address Reviewed-by: shade Backport-of: 6ed81641b101658fbbd35445b6dd74ec17fc20f3 ------------- PR: https://git.openjdk.org/jdk/pull/26352 From xgong at openjdk.org Thu Jul 17 02:43:48 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 17 Jul 2025 02:43:48 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> Message-ID: On Thu, 17 Jul 2025 01:20:44 GMT, Xiaohong Gong wrote: > > > * case-2: 2 times of gather and merge > > > > > > * Can be refined. But the `LoadVectorGatherNode` should be changed to accept 2 `index` vectors. > > > * case-3: 4 times of gather and merge (only for byte) > > > > > > * Can be refined. We can implement it just like: > > > step-1: `v1 = gather1 + gather2 + 2 * uzp1` // merging the first and second gather-loads > > > step-2: `v2 = gather3 + gather4 + 2 * uzp1` // merging the third and fourth gather-loads > > > step-3: `v3 = slice (v2, v2)`, `v = or(v1, v3)` // do the final merging > > > We have to change `LoadVectorGatherNode` as well. At least making it accept 2 `index` vectors. > > > > > > As a summary, `LoadVectorGatherNode` will be more complex than before. But the good thing is, giving it one more `index` input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks! > > > > > > @XiaohongGong thanks for your reply. > > This idea generally looks good to me. > > For case-2, we have > > ``` > > gather1 + gather2 + uzp1: > > [0a 0a 0a 0a 0a 0a 0a 0a] > > [0b 0b 0b 0b 0b 0b 0b 0b] > > uzp1.H => > > [bb bb bb bb aa aa aa aa] > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Can we improve `case-3` by following the pattern of `case-2`? > > ``` > > step-1: v1 = gather1 + gather2 + uzp1 > > [000a 000a 000a 000a 000a 000a 000a 000a] > > [000b 000b 000b 000b 000b 000b 000b 000b] > > uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a] > > > > step-2: v2 = gather3 + gather4 + uzp1 > > [000c 000c 000c 000c 000c 000c 000c 000c] > > [000d 000d 000d 000d 000d 000d 000d 000d] > > uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c] > > > > step-3: v3 = uzp1 (v1, v2) > > [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a] > > [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c] > > uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa] > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Then we can also consistently define the semantics of `LoadVectorGatherNode` as `gather1 + gather2 + uzp1.H `, which would make backend much cleaner. WDYT? > > Thanks! Regarding to the definitation of `LoadVectorGatherNode`, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, `uzp1` is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right? I'm not sure whether this makes sense. I will take a considering for this suggestion. Maybe I can define the vector type of `LoadVectorGatherNode` as int vector type for subword types. An additional flag is necessary to denote whether it is a byte or short loading. It only finishes the gather operation (without any truncating). And define an IR like `VectorConcateNode` to merge all the gather results. It can merge either two gathers or four gathers. For cases that only one time of gather is needed, we can just return a type cast node like `VectorCastI2X`. Seems this will make the IR more common and code more clean. WDYT? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3082248439 From fyang at openjdk.org Thu Jul 17 05:59:58 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 17 Jul 2025 05:59:58 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 14:05:25 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: > > - removed tail processing with RVV instructions as simple scalar loop provides in general better results Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3082664335 From chagedorn at openjdk.org Thu Jul 17 06:25:59 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 17 Jul 2025 06:25:59 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v37] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 09:38:18 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > test failures `tier1-4,hs-precheckin-comp,hs-comp-stress` looked good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-3028074186 From mchevalier at openjdk.org Thu Jul 17 07:51:55 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 17 Jul 2025 07:51:55 GMT Subject: RFR: 8361890: Aarch64: Removal of redundant dmb from C1 AtomicLong methods In-Reply-To: <60YMRP6cNslwEeVX2TWmnMYdO872xGaeShKMEj0dWGY=.2f4f504f-93d1-4bab-b721-e5c964f4c465@github.com> References: <60YMRP6cNslwEeVX2TWmnMYdO872xGaeShKMEj0dWGY=.2f4f504f-93d1-4bab-b721-e5c964f4c465@github.com> Message-ID: On Thu, 10 Jul 2025 15:49:40 GMT, Samuel Chee wrote: > The current C1 implementation of AtomicLong methods > which either adds or exchanges (such as getAndAdd) > emit one of a ldaddal and swpal respectively when using > LSE as well as an immediately proceeding dmb. Since > ldaddal/swpal have both acquire and release semantics, > this provides similar ordering guarantees to a dmb.full > so the dmb here is redundant and can be removed. > > This is due to both clause 7 and clause 11 of the > definition of Barrier-ordered-before in B2.3.7 of the > DDI0487 L.a Arm Architecture Reference Manual for A-profile > architecture being satisfied by the existence of a > ldaddal/swpal which ensures such memory ordering guarantees. Hi! Thanks for looking at this. I've started some testing, will keep you updated. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26245#issuecomment-3082996958 From thartmann at openjdk.org Thu Jul 17 08:31:57 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 17 Jul 2025 08:31:57 GMT Subject: RFR: 8358573: Remove the -XX:-InstallMethods debug flag [v2] In-Reply-To: <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> Message-ID: On Wed, 16 Jul 2025 13:49:56 GMT, Beno?t Maillard wrote: >> This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. >> >> ## Analysis >> >> We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: >> >> ```c++ >> if (!ci_env.failing() && !task->is_success()) { >> assert(ci_env.failure_reason() != nullptr, "expect failure reason"); >> assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); >> // The compiler elected, without comment, not to register a result. >> // Do not attempt further compilations of this method. >> ci_env.record_method_not_compilable("compile failed"); >> } >> >> >> The `task->is_success()` call accesses the private `_is_success` field. >> This field is modified in `CompileTask::mark_success`. >> >> By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: >> >> CompileTask::mark_success compileTask.hpp:185 >> nmethod::post_compiled_method nmethod.cpp:2212 >> ciEnv::register_method ciEnv.cpp:1127 >> Compilation::install_code c1_Compilation.cpp:425 >> Compilation::compile_method c1_Compilation.cpp:488 >> Compilation::Compilation c1_Compilation.cpp:609 >> Compiler::compile_method c1_Compiler.cpp:262 >> CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 >> CompileBroker::compiler_thread_loop compileBroker.cpp:1968 >> CompilerThread::thread_entry compilerThread.cpp:67 >> JavaThread::thread_main_inner javaThread.cpp:773 >> JavaThread::run javaThread.cpp:758 >> Thread::call_run thread.cpp:243 >> thread_native_entry os_linux.cpp:868 >> >> >> We go up the stacktrace and see that in `Compilation::compile_method` we have: >> >> ```c++ >> if (should_install_code()) { >> // install code >> PhaseTraceTime timeit(_t_codeinstall); >> install_code(frame_size); >> } >> >> >> If we do not install methods after compilation, the code path that marks the success is never executed >> and therefore results in hitting the assert. >> >> ### Fix >> We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. >> After careful consideration, it was decided to simply get rid of the `-XX:-InstallMethods` flag. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) >> - [... > > Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: > > - 8358573: get rid of InstallMethods flags completely > - Revert "8358573: Add missing task success notification" > > This reverts commit cd91c7c06ba05aba3500b95ba1317539363aa63c. > - Revert "8358573: Add test for -XX:-InstallMethods" > > This reverts commit 6eab84718c3b60c2585bc2711c4bc8144472975b. Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26310#pullrequestreview-3028488308 From mchevalier at openjdk.org Thu Jul 17 08:48:35 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 17 Jul 2025 08:48:35 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph Message-ID: Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: 1 failure for node 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 At node 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) From path: [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) <-(0)- 210 IfFalse === 209 [[ 215 216 ]] #0 !orig=198 !jvms: StringLatin1::equals @ bci:12 (line 100) <-(0)- 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) # OuterStripMinedLoopInvariants: Unexpected type: CountedLoopEnd. or with outputs: 1 failure for node 413 OuterStripMinedLoopEnd === 417 41 [[ 414 399 ]] P=0,960468, C=22887,000000 At node 415 OuterStripMinedLoop === 415 180 414 [[ 415 416 ]] From path: [center] 413 OuterStripMinedLoopEnd === 417 41 [[ 414 399 ]] P=0,960468, C=22887,000000 --> 414 IfTrue === 413 [[ 415 ]] #1 --> 415 OuterStripMinedLoop === 415 180 414 [[ 415 416 ]] # OuterStripMinedLoopInvariants: Non-unique output of expected type. Found: 0. So far a small set of checks are implemented: - IfProjections: check that `If` nodes have a `IfTrue` and `IfFalse` - PhiArity: check that `Phi` nodes have a `Region` node of the same arity as 0th input - ControlSuccessor: check that control nodes have the right amount of successors (usually 1, but 2 for if-related nodes...) - RegionSelfLoop: check that regions are either copy, or have a self loop as 0th input - CountedLoopInvariants: check the structure around the backedge of a counted loop - OuterStripMinedLoopInvariants: check the structure around `OuterStripMinedLoopEnd` - MultiBranchNodeOut: check that for `MultiBranch`, `outcnt` is smaller than or equal to `required_outcnt` (it is legitimate to have a smaller number of output, especially after some optimizations). Some of these checks have an additional subtlety: it's ok to have some wrong shape in dead code, for instance `IfProjections`. After a lot of investigation, it seems that some dead loops are not always detected eagerly and can make some control path survive longer, until being removed before loop opts. This seems to be by design to avoid traversing the whole graph everytime a region lose an input. It seems such misshape is harmless because they are not reachable from the inputs, and the cost of removing them would be prohibitive. To deal with such cases, when such a check fails, we check whether it happened in dead code. The dead of unreachable control nodes is lazily computed to answer that, and it's shared across checkers. While computing unreachable nodes is somewhat expensive, it seems to happen rarely in practice. This verification has found [JDK-8359344](https://bugs.openjdk.org/browse/JDK-8359344) and [JDK-8359121](https://bugs.openjdk.org/browse/JDK-8359121). It has been run on tiers 1 to 3, plus some internal testing and, after fixing the above-mentioned, it seems all passing! Related future: add more checks, should be easy. Less related future: could we imagine using similar patterns (without the error reporting mechanism) to use for optimizations, instead of manual traversing? It could make the code clearer to understand. We could also imagine optionally using such things in idealization to declare which patterns nodes are looking for, and if they have depth greater than 1, automatically adapting the enqueuing strategy without having to pimp `PhaseIterGVN::add_users_of_use_to_worklist` everytime. Could at least cover some basic (but numerous) cases. ------------- Commit messages: - Fix declaration - More comments - Improve printing and memory footprint - Improve printing - Handle NeverBranch in ControlSuccessor - Verify structural invariants Changes: https://git.openjdk.org/jdk/pull/26362/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8350864 Stats: 728 lines in 5 files changed: 728 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26362.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26362/head:pull/26362 PR: https://git.openjdk.org/jdk/pull/26362 From fgao at openjdk.org Thu Jul 17 09:04:49 2025 From: fgao at openjdk.org (Fei Gao) Date: Thu, 17 Jul 2025 09:04:49 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> Message-ID: On Thu, 17 Jul 2025 02:41:20 GMT, Xiaohong Gong wrote: > Thanks! Regarding to the definitation of `LoadVectorGatherNode`, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, `uzp1` is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right? That makes sense to me. Thanks for your explanation! > Maybe I can define the vector type of `LoadVectorGatherNode` as int vector type for subword types. An additional flag is necessary to denote whether it is a byte or short loading. It only finishes the gather operation (without any truncating). And define an IR like `VectorConcateNode` to merge the gather results. For cases that only one time of gather is needed, we can just return a type cast node like `VectorCastI2X`. Seems this will make the IR more common and code more clean. > > The implementation would like: > > * case-1 one gather: > > * `gather (bt: int)` + `cast (bt: byte|short)` > * case-2 two gathers: > > * step-1: `gather1 (bt: int)` + `gather2 (bt: int)` + `concate(gather1, gather2) (bt: short)` > * step-2: `cast (bt: byte)` // just for byte vectors > * case-3 four gathers: > > * step-1: `gather1 (bt: int)` + `gather2 (bt: int)` + `concate(gather1, gather2) (bt: short)` > * step-2: `gather3 (bt: int)` + `gather4 (bt: int)` + `concate(gather3, gather3) (bt: short)` > * step-3: `concate (bt: byte)` > > Or more commonly: > > * case-1 one gather: > > * `gather (bt: int)` + `cast (bt: byte|short)` > * case-2 two gathers: > > * step-1: `gather1 (bt: int)` + `gather2 (bt: int)` + `concate(gather1, gather2) (bt: byte|short)` > * case-3 four gathers: > > * step-1: `gather1 (bt: int)` + `gather2 (bt: int)` + `gather3 (bt: int)` + `gather4 (bt: int)` > * step-2: `concate(gather1, gather2, gather3, gather4) (bt: byte|short)` > > From the IR level, which one do you think is better? I like this idea! The first one looks better, in which `concate` would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3083240544 From xgong at openjdk.org Thu Jul 17 09:04:50 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 17 Jul 2025 09:04:50 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> Message-ID: <4tejg5hp-eHBmAEvKbpTg_mv_TUYU5kg0HIccmWyac8=.3638758e-5000-4d1f-924f-abb4a21952c6@github.com> On Thu, 17 Jul 2025 08:59:08 GMT, Fei Gao wrote: > I like this idea! The first one looks better, in which `concate` would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios. Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3083249910 From duke at openjdk.org Thu Jul 17 09:09:14 2025 From: duke at openjdk.org (erifan) Date: Thu, 17 Jul 2025 09:09:14 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: References: Message-ID: > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is > relative smaller than that of `fromLong`. So this patch does the conversion for these cases. > > The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. > > Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. > > This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. > > As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like > > VectorMaskToLong (VectorLongToMask x) => x > > > Hence, this patch also added the following optimizations: > > VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > > VectorMaskCast (VectorMaskCast x) => x > > And we can see noticeable performance improvement with the above optimizations for floating-point types. > > Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 > microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 > microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 > microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 > microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 > microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 > microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 > microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 > > > Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double... erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Refactor the implementation Do the convertion in C2's IGVN phase to cover more cases. - Merge branch 'master' into JDK-8356760 - Simplify the test code - Address some review comments Add support for the following patterns: toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) toLong(maskAll(false)) => 0 And add more test cases. - Merge branch 'master' into JDK-8356760 - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. Some JTReg test cases are added to ensure the optimization is effective. I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. [1] https://github.com/openjdk/jdk/pull/24674 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25793/files - new: https://git.openjdk.org/jdk/pull/25793/files/9f07d5c7..8ebe5e56 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=02-03 Stats: 21470 lines in 667 files changed: 10937 ins; 6238 del; 4295 mod Patch: https://git.openjdk.org/jdk/pull/25793.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793 PR: https://git.openjdk.org/jdk/pull/25793 From duke at openjdk.org Thu Jul 17 09:09:52 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Thu, 17 Jul 2025 09:09:52 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 05:57:33 GMT, Fei Yang wrote: > Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)? You are right: the non-RVV version of intrinsic performs worse on BPI-F3 hardware with size > 70, though originally it was better on StarFive JH7110 and T-Head RVB-ICE, please see https://github.com/openjdk/jdk/pull/16629. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3083267477 From duke at openjdk.org Thu Jul 17 09:12:08 2025 From: duke at openjdk.org (erifan) Date: Thu, 17 Jul 2025 09:12:08 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: <7QVWVj5vpSB42THa2rx-oxMqhH76qMZ5MBJjindRiLo=.b825076a-aa9c-4b86-94b6-0a593f2240ac@github.com> References: <_RERljqu_FG7ZyneAk7Thd-9TwED18pQpEBz_i105fY=.b8948a23-273a-49f6-b9cb-6b611a5eedc6@github.com> <6SXA9ZrXBDhZLyXP3lXbkpl4dl3iocvDpzPrUpIQOl8=.9b025be2-848b-4b78-a5e4-929cb7e9f798@github.com> <7QVWVj5vpSB42THa2rx-oxMqhH76qMZ5MBJjindRiLo=.b825076a-aa9c-4b86-94b6-0a593f2240ac@github.com> Message-ID: On Thu, 10 Jul 2025 08:06:18 GMT, erifan wrote: >>> What if during iterative GVN a constant -1 seeps through IR graph and gets connected to the input of VectorLongToMaskNode, you won't be able to create maskAll true in that case? >> >> Yes, this PR doesn't support this case. Maybe we should do this optimization in `ideal`. If `VectorLongToMask` is not supported, then try to convert it to `maskAll` or `Replicate` in intrinsic. >> >>> Do you see any advantage of doing this at intrinsic layer over entirely handling it in Java implimentation by simply modifying the opcode of fromBitsCoerced to MODE_BROADCAST from existing MODE_BITS_COERCED_LONG_TO_MASK for 0 or -1 input. >> >> I had tried this method and gave it up, because it has up to 34% performance regression for specific cases on x64. > > OK. But in order to cover various cases, the implementation may be a bit troublesome. The solution I thought of is to **check whether the architecture supports VectorLongToMask, MaskAll and Replicate in `LibraryCallKit::inline_vector_frombits_coerced`. If it does, generate VectorLongToMask, and then convert it to MaskAll or Replicate in IGVN**. This is similar to the current implementation of vector rotate. > > At the same time, this conversion may affect some other optimizations, such as `VectorMaskToLong(VectorLongToMask (x)) => x` and `VectorStoreMask(VectorLoadMask (x)) => x`. So we also need to fix these effects. I have refactor the implementation, please help take a look, thanks~ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2212787090 From duke at openjdk.org Thu Jul 17 09:16:50 2025 From: duke at openjdk.org (erifan) Date: Thu, 17 Jul 2025 09:16:50 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 09:09:14 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Refactor the implementation > > Do the convertion in C2's IGVN phase to cover more cases. > - Merge branch 'master' into JDK-8356760 > - Simplify the test code > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 As @jatin-bhateja suggested, I have refactored the implementation and updated the commit message, please help review this PR, thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3083287653 From shade at openjdk.org Thu Jul 17 09:38:59 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 17 Jul 2025 09:38:59 GMT Subject: RFR: 8358573: Remove the -XX:-InstallMethods debug flag [v2] In-Reply-To: <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> Message-ID: On Wed, 16 Jul 2025 13:49:56 GMT, Beno?t Maillard wrote: >> This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. >> >> ## Analysis >> >> We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: >> >> ```c++ >> if (!ci_env.failing() && !task->is_success()) { >> assert(ci_env.failure_reason() != nullptr, "expect failure reason"); >> assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); >> // The compiler elected, without comment, not to register a result. >> // Do not attempt further compilations of this method. >> ci_env.record_method_not_compilable("compile failed"); >> } >> >> >> The `task->is_success()` call accesses the private `_is_success` field. >> This field is modified in `CompileTask::mark_success`. >> >> By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: >> >> CompileTask::mark_success compileTask.hpp:185 >> nmethod::post_compiled_method nmethod.cpp:2212 >> ciEnv::register_method ciEnv.cpp:1127 >> Compilation::install_code c1_Compilation.cpp:425 >> Compilation::compile_method c1_Compilation.cpp:488 >> Compilation::Compilation c1_Compilation.cpp:609 >> Compiler::compile_method c1_Compiler.cpp:262 >> CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 >> CompileBroker::compiler_thread_loop compileBroker.cpp:1968 >> CompilerThread::thread_entry compilerThread.cpp:67 >> JavaThread::thread_main_inner javaThread.cpp:773 >> JavaThread::run javaThread.cpp:758 >> Thread::call_run thread.cpp:243 >> thread_native_entry os_linux.cpp:868 >> >> >> We go up the stacktrace and see that in `Compilation::compile_method` we have: >> >> ```c++ >> if (should_install_code()) { >> // install code >> PhaseTraceTime timeit(_t_codeinstall); >> install_code(frame_size); >> } >> >> >> If we do not install methods after compilation, the code path that marks the success is never executed >> and therefore results in hitting the assert. >> >> ### Fix >> We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. >> After careful consideration, it was decided to simply get rid of the `-XX:-InstallMethods` flag. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) >> - [... > > Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: > > - 8358573: get rid of InstallMethods flags completely > - Revert "8358573: Add missing task success notification" > > This reverts commit cd91c7c06ba05aba3500b95ba1317539363aa63c. > - Revert "8358573: Add test for -XX:-InstallMethods" > > This reverts commit 6eab84718c3b60c2585bc2711c4bc8144472975b. I agree CTW does not need this flag. We have (or will have) enough internal APIs to avoid installing the code, if we need it for any reason. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26310#pullrequestreview-3028738102 From vlivanov at openjdk.org Thu Jul 17 09:45:47 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Thu, 17 Jul 2025 09:45:47 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 07:25:10 GMT, Marc Chevalier wrote: > Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. > > Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. > > This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. > > For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. > > On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: > > 1 failure for node > 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > At node > 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) > From path: > [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) > <-(0)- 210 IfFalse === 209 [[ 215 216 ]] #0 !orig=198 !jvms: StringL... Very nice! Some high-level comments: * IMO it's better to have node-specific invariant checks co-located with corresponding node (as `Node::verify()` maybe?); it would make it clearer what are the expectations when changing the implementation. * on naming: IMO `VerifyIdealGraph` would clearly describe what the logic does, fits existing conventions well, and easy to find ------------- PR Review: https://git.openjdk.org/jdk/pull/26362#pullrequestreview-3028766202 From luhenry at openjdk.org Thu Jul 17 10:18:51 2025 From: luhenry at openjdk.org (Ludovic Henry) Date: Thu, 17 Jul 2025 10:18:51 GMT Subject: RFR: 8362284: RISC-V: cleanup NativeMovRegMem [v2] In-Reply-To: References: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> Message-ID: <0GTMPYRNr_vi-NOJY3b5KYX68iXKPB3A4x_rv6K1J2c=.e824b546-3c1b-4945-ad0d-102aca58d26b@github.com> On Wed, 16 Jul 2025 10:48:23 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this simple patch? >> >> NativeMovRegMem on riscv is actually dead code, but still needed in case of compilation of C1. >> So make the code as simple as possible to avoid any reading and maintainance effort. >> >> No tests, as `offset()` and `set_offset()` were Unimplemented and used in C1 and never triggered before. >> >> Thanks! > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > use nullptr Marked as reviewed by luhenry (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26328#pullrequestreview-3028884415 From mli at openjdk.org Thu Jul 17 10:48:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 17 Jul 2025 10:48:55 GMT Subject: Integrated: 8362284: RISC-V: cleanup NativeMovRegMem In-Reply-To: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> References: <7QlEqzQUzoDK6NycLx0HECjospeft1MwbOQh7aHVq8U=.efa2d92e-a024-4447-9565-8f6ee7ee4774@github.com> Message-ID: On Tue, 15 Jul 2025 18:41:56 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple patch? > > NativeMovRegMem on riscv is actually dead code, but still needed in case of compilation of C1. > So make the code as simple as possible to avoid any reading and maintainance effort. > > No tests, as `offset()` and `set_offset()` were Unimplemented and used in C1 and never triggered before. > > Thanks! This pull request has now been integrated. Changeset: 3fd89be6 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/3fd89be6d1a51b6fc99f4c0b5daba7a4bd64a08e Stats: 40 lines in 2 files changed: 0 ins; 33 del; 7 mod 8362284: RISC-V: cleanup NativeMovRegMem Reviewed-by: fyang, luhenry ------------- PR: https://git.openjdk.org/jdk/pull/26328 From mli at openjdk.org Thu Jul 17 11:14:00 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 17 Jul 2025 11:14:00 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to Message-ID: Hi, Can you help to review this simple patch? `CodeBuffer::copy_relocations_to(address buf, csize_t buf_limit, bool only_inst)` is only used in `copy_relocations_to(CodeBlob* dest)` which passes false to only_inst, so the former one should be able to be simplified. Thank you! ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/26366/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26366&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362493 Stats: 10 lines in 2 files changed: 1 ins; 6 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/26366.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26366/head:pull/26366 PR: https://git.openjdk.org/jdk/pull/26366 From mchevalier at openjdk.org Thu Jul 17 11:21:54 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 17 Jul 2025 11:21:54 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 09:42:47 GMT, Vladimir Ivanov wrote: > IMO it's better to have node-specific invariant checks co-located with corresponding node (as Node::verify() maybe?); it would make it clearer what are the expectations when changing the implementation. I understand the motivation, but I'm not sure what to do in every case. For instance, when the pattern is not so small (like strip mining), it's hard to associate the invariant with a single node: many are involved, and it's not really describable as the expectation of a single node. Of course, we could split the pattern into a lot of sub-patterns, centered each around each node type, but then, we lose the overview of the structure, and it becomes context-free (e.g. a IfFalse must have a CountedLoopEnd input only when it comes before a safepoint before a OuterStripMinedLoopEnd, but not in general). Another problematic case is the control successor check that has special handling for some kinds, that could be relocated to the said node types, but for the general case, it simply tests whether the node is a CFG node. One could still do that in `Node`, but it's then not tied to a specific node type, and I feel it bloats the `Node` class (that is already not so small). And then, I fear it makes the invariant harder to read since it will be distributed across many node types: I would need to find overrides of `Verify`. When working on a given node, it may be easier to see what I need to guarantee (or change the invariant), but when working on something else, it makes harder to find the invariants I can actually rely on, because it could always be overridden in a derived class. I also have a readability concern. Even if we sort them by node types, then we mix the implementation of all invariants of a given node in a single method, making it extra-hard to understand the big picture, and when looking for overrides, I will find some, but maybe they won't be about the invariant I'm interested in. And a code/maintenance concern. If I have a default implementation of the control successor check in `Node`, among other such general checks, I'm tempted to override it in `IfNode` to accept having more than one successor, but then, how do I perform the other general checks that still hold? I can't call `Node::Verify` since it will enforce the wrong number of successor. I could put these checks in another method, and call it from `IfNode::Verify`, but it has other annoying consequences: if they are all in the same side method, I can't customize another of these checks in another node type; if each check is in its own method all called from `Node::Verify`, I need to repeat the call in the overrides of `Verify` for all the checks one will add in the future... Overall, it seems risky to maintain. We could also just call `Node::Verify` and have here some handling to skip some steps for some node types, but I feel like that defeats the point of having invariants closer from the type. Overall, it seems to me that it's beneficial to move checks to the node types if: - the pattern is small and clearly has a privileged node, so we won't be surprised by having the invariant implemented in another node type - the pattern doesn't have special cases for sub-types. For instance, `PhiArity` would be a good candidate (about a special kind of node, no context needed, no exception). So, maybe a solution would be to split the checks in two sources: some that are like this, implemented directly in the node, and some that are less local (not about a node, but about bigger shapes), or needs more special cases, and that we keep standalone. I don't think having two sources of invariants is a problem at all. > on naming: IMO VerifyIdealGraph would clearly describe what the logic does, fits existing conventions well, and easy to find Sure, fine with me! I'd be curious to see if somebody has other ideas. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26362#issuecomment-3083672692 PR Comment: https://git.openjdk.org/jdk/pull/26362#issuecomment-3083677141 From fgao at openjdk.org Thu Jul 17 11:30:48 2025 From: fgao at openjdk.org (Fei Gao) Date: Thu, 17 Jul 2025 11:30:48 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: <4tejg5hp-eHBmAEvKbpTg_mv_TUYU5kg0HIccmWyac8=.3638758e-5000-4d1f-924f-abb4a21952c6@github.com> References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> <4tejg5hp-eHBmAEvKbpTg_mv_TUYU5kg0HIccmWyac8=.3638758e-5000-4d1f-924f-abb4a21952c6@github.com> Message-ID: On Thu, 17 Jul 2025 09:02:00 GMT, Xiaohong Gong wrote: > > Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion! Thanks! I?d suggest also highlighting `aarch64` in the JBS title, so others who are interested won?t miss it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3083699788 From mhaessig at openjdk.org Thu Jul 17 12:16:48 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 17 Jul 2025 12:16:48 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 11:09:09 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple patch? > > `CodeBuffer::copy_relocations_to(address buf, csize_t buf_limit, bool only_inst)` is only used in `copy_relocations_to(CodeBlob* dest)` which passes false to only_inst, so the former one should be able to be simplified. > > Thank you! Thank you for working on this cleanup, @Hamlin-Li! It looks good to me. I kicked off some testing on our side and will let you know what the results are. ------------- PR Review: https://git.openjdk.org/jdk/pull/26366#pullrequestreview-3029262121 From duke at openjdk.org Thu Jul 17 12:29:54 2025 From: duke at openjdk.org (duke) Date: Thu, 17 Jul 2025 12:29:54 GMT Subject: RFR: 8358573: Remove the -XX:-InstallMethods debug flag [v2] In-Reply-To: <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> <5wFwGsfzeYOy0uMPSbg9TfzEzZqhgPYLcY26FF7My9s=.1b474ce7-ea02-4adb-b938-31208cb31ec3@github.com> Message-ID: On Wed, 16 Jul 2025 13:49:56 GMT, Beno?t Maillard wrote: >> This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. >> >> ## Analysis >> >> We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: >> >> ```c++ >> if (!ci_env.failing() && !task->is_success()) { >> assert(ci_env.failure_reason() != nullptr, "expect failure reason"); >> assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); >> // The compiler elected, without comment, not to register a result. >> // Do not attempt further compilations of this method. >> ci_env.record_method_not_compilable("compile failed"); >> } >> >> >> The `task->is_success()` call accesses the private `_is_success` field. >> This field is modified in `CompileTask::mark_success`. >> >> By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: >> >> CompileTask::mark_success compileTask.hpp:185 >> nmethod::post_compiled_method nmethod.cpp:2212 >> ciEnv::register_method ciEnv.cpp:1127 >> Compilation::install_code c1_Compilation.cpp:425 >> Compilation::compile_method c1_Compilation.cpp:488 >> Compilation::Compilation c1_Compilation.cpp:609 >> Compiler::compile_method c1_Compiler.cpp:262 >> CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 >> CompileBroker::compiler_thread_loop compileBroker.cpp:1968 >> CompilerThread::thread_entry compilerThread.cpp:67 >> JavaThread::thread_main_inner javaThread.cpp:773 >> JavaThread::run javaThread.cpp:758 >> Thread::call_run thread.cpp:243 >> thread_native_entry os_linux.cpp:868 >> >> >> We go up the stacktrace and see that in `Compilation::compile_method` we have: >> >> ```c++ >> if (should_install_code()) { >> // install code >> PhaseTraceTime timeit(_t_codeinstall); >> install_code(frame_size); >> } >> >> >> If we do not install methods after compilation, the code path that marks the success is never executed >> and therefore results in hitting the assert. >> >> ### Fix >> We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. >> After careful consideration, it was decided to simply get rid of the `-XX:-InstallMethods` flag. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) >> - [... > > Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: > > - 8358573: get rid of InstallMethods flags completely > - Revert "8358573: Add missing task success notification" > > This reverts commit cd91c7c06ba05aba3500b95ba1317539363aa63c. > - Revert "8358573: Add test for -XX:-InstallMethods" > > This reverts commit 6eab84718c3b60c2585bc2711c4bc8144472975b. @benoitmaillard Your change (at version 2da73a5df46f47addc9ae6a9d32c69be1a9fc2a2) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26310#issuecomment-3083877893 From bmaillard at openjdk.org Thu Jul 17 12:43:08 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Thu, 17 Jul 2025 12:43:08 GMT Subject: Integrated: 8358573: Remove the -XX:-InstallMethods debug flag In-Reply-To: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> References: <42w-ek2nmUZf45VvJRiQRpxv39jkRLTSYVEvH1uP0hk=.6330711d-3429-4ce8-b5dc-22bbb8aa7657@github.com> Message-ID: On Tue, 15 Jul 2025 09:21:53 GMT, Beno?t Maillard wrote: > This PR prevents from hitting an assert when disabling method installation at the end of a successful compilation with the `-XX:-InstallMethods` flag. Previously `CompileBroker` failed to mark the `CompileTask` as complete when using this flag. > > ## Analysis > > We can see that the assert is triggered in `CompileBroker::invoke_compiler_on_method`: > > ```c++ > if (!ci_env.failing() && !task->is_success()) { > assert(ci_env.failure_reason() != nullptr, "expect failure reason"); > assert(false, "compiler should always document failure: %s", ci_env.failure_reason()); > // The compiler elected, without comment, not to register a result. > // Do not attempt further compilations of this method. > ci_env.record_method_not_compilable("compile failed"); > } > > > The `task->is_success()` call accesses the private `_is_success` field. > This field is modified in `CompileTask::mark_success`. > > By setting a breakpoint there, and execute the program without `-XX:-InstallMethods`, we get the following stacktrace: > > CompileTask::mark_success compileTask.hpp:185 > nmethod::post_compiled_method nmethod.cpp:2212 > ciEnv::register_method ciEnv.cpp:1127 > Compilation::install_code c1_Compilation.cpp:425 > Compilation::compile_method c1_Compilation.cpp:488 > Compilation::Compilation c1_Compilation.cpp:609 > Compiler::compile_method c1_Compiler.cpp:262 > CompileBroker::invoke_compiler_on_method compileBroker.cpp:2324 > CompileBroker::compiler_thread_loop compileBroker.cpp:1968 > CompilerThread::thread_entry compilerThread.cpp:67 > JavaThread::thread_main_inner javaThread.cpp:773 > JavaThread::run javaThread.cpp:758 > Thread::call_run thread.cpp:243 > thread_native_entry os_linux.cpp:868 > > > We go up the stacktrace and see that in `Compilation::compile_method` we have: > > ```c++ > if (should_install_code()) { > // install code > PhaseTraceTime timeit(_t_codeinstall); > install_code(frame_size); > } > > > If we do not install methods after compilation, the code path that marks the success is never executed > and therefore results in hitting the assert. > > ### Fix > We simply mark the task as complete when `should_install_code()` evaluates to `false` in the block code above. > After careful consideration, it was decided to simply get rid of the `-XX:-InstallMethods` flag. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8358573) > - [x] tier1-3, plus some internal testing > - [x] Added a test that starts the VM with the `-XX:-InstallMethods` flag > > ... This pull request has now been integrated. Changeset: 1d73f884 Author: Beno?t Maillard Committer: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/1d73f8842a6aa0fae7c7960eb5720447a1224792 Stats: 4 lines in 2 files changed: 0 ins; 3 del; 1 mod 8358573: Remove the -XX:-InstallMethods debug flag Reviewed-by: dlong, thartmann, shade ------------- PR: https://git.openjdk.org/jdk/pull/26310 From duke at openjdk.org Thu Jul 17 12:43:39 2025 From: duke at openjdk.org (Samuel Chee) Date: Thu, 17 Jul 2025 12:43:39 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v2] In-Reply-To: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: > AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg: > > ;; cmpxchg { > 0x0000e708d144cf60: mov x8, x2 > 0x0000e708d144cf64: casal x8, x3, [x0] > 0x0000e708d144cf68: cmp x8, x2 > ;; 0x1F1F1F1F1F1F1F1F > 0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f > ;; } cmpxchg > 0x0000e708d144cf70: cset x8, ne // ne = any > 0x0000e708d144cf74: dmb ish > > > According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2] > >> Atomically sets the value of a variable to the >> newValue with the memory semantics of setVolatile if >> the variable's current value, referred to as the witness >> value, == the expectedValue, as accessed with the memory >> semantics of getVolatile. > > > > Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen. > > Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant) > > This is also reflected by C2 not having a dmb for the same respective method. > > [1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) > [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...) Samuel Chee has updated the pull request incrementally with one additional commit since the last revision: Add back in dmb membar for non-LSE Change-Id: Ie64565420a1758d3191eaebed82c80584ce54ef6 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26000/files - new: https://git.openjdk.org/jdk/pull/26000/files/577e9a20..181ce0b7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26000&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26000&range=00-01 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26000.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26000/head:pull/26000 PR: https://git.openjdk.org/jdk/pull/26000 From duke at openjdk.org Thu Jul 17 12:45:03 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Thu, 17 Jul 2025 12:45:03 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 14:05:25 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: > > - removed tail processing with RVV instructions as simple scalar loop provides in general better results > > Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)? > > You are right: the non-RVV version of intrinsic performs worse on BPI-F3 hardware with size > 70, though originally it was better on StarFive JH7110 and T-Head RVB-ICE, please see #16629. Hm, it is still good on Lichee Pi 4A: $ ( for i in "-XX:DisableIntrinsic=_vectorizedHashCode" " " ; do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java -jar benchmarks.jar --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" org.openjdk.bench.java.lang.ArraysHashCode.ints -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 -f 3 -r 1 -w 1 -wi 10 -i 10 2>&1 | tail -15 ) done ) --- -XX:DisableIntrinsic=_vectorizedHashCode --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 51.709 ? 3.815 ns/op ArraysHashCode.ints 5 avgt 30 68.146 ? 1.833 ns/op ArraysHashCode.ints 10 avgt 30 89.217 ? 0.496 ns/op ArraysHashCode.ints 20 avgt 30 140.807 ? 9.335 ns/op ArraysHashCode.ints 30 avgt 30 172.030 ? 4.025 ns/op ArraysHashCode.ints 40 avgt 30 222.927 ? 10.342 ns/op ArraysHashCode.ints 50 avgt 30 251.719 ? 0.686 ns/op ArraysHashCode.ints 60 avgt 30 305.947 ? 10.532 ns/op ArraysHashCode.ints 70 avgt 30 347.602 ? 7.024 ns/op ArraysHashCode.ints 80 avgt 30 382.057 ? 1.520 ns/op ArraysHashCode.ints 90 avgt 30 426.022 ? 31.800 ns/op ArraysHashCode.ints 100 avgt 30 457.737 ? 0.652 ns/op ArraysHashCode.ints 200 avgt 30 913.501 ? 3.258 ns/op ArraysHashCode.ints 300 avgt 30 1297.355 ? 2.383 ns/op --- --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 50.141 ? 0.463 ns/op ArraysHashCode.ints 5 avgt 30 62.921 ? 2.538 ns/op ArraysHashCode.ints 10 avgt 30 77.686 ? 2.577 ns/op ArraysHashCode.ints 20 avgt 30 102.736 ? 0.136 ns/op ArraysHashCode.ints 30 avgt 30 137.592 ? 4.232 ns/op ArraysHashCode.ints 40 avgt 30 157.376 ? 0.302 ns/op ArraysHashCode.ints 50 avgt 30 196.068 ? 3.812 ns/op ArraysHashCode.ints 60 avgt 30 212.956 ? 2.075 ns/op ArraysHashCode.ints 70 avgt 30 251.260 ? 1.176 ns/op ArraysHashCode.ints 80 avgt 30 266.223 ? 0.655 ns/op ArraysHashCode.ints 90 avgt 30 313.465 ? 6.810 ns/op ArraysHashCode.ints 100 avgt 30 373.024 ? 1.005 ns/op ArraysHashCode.ints 200 avgt 30 620.723 ? 24.313 ns/op ArraysHashCode.ints 300 avgt 30 881.358 ? 1.320 ns/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3083927127 From duke at openjdk.org Thu Jul 17 12:46:53 2025 From: duke at openjdk.org (Samuel Chee) Date: Thu, 17 Jul 2025 12:46:53 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v2] In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 17 Jul 2025 12:43:39 GMT, Samuel Chee wrote: >> AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg: >> >> ;; cmpxchg { >> 0x0000e708d144cf60: mov x8, x2 >> 0x0000e708d144cf64: casal x8, x3, [x0] >> 0x0000e708d144cf68: cmp x8, x2 >> ;; 0x1F1F1F1F1F1F1F1F >> 0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f >> ;; } cmpxchg >> 0x0000e708d144cf70: cset x8, ne // ne = any >> 0x0000e708d144cf74: dmb ish >> >> >> According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2] >> >>> Atomically sets the value of a variable to the >>> newValue with the memory semantics of setVolatile if >>> the variable's current value, referred to as the witness >>> value, == the expectedValue, as accessed with the memory >>> semantics of getVolatile. >> >> >> >> Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen. >> >> Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant) >> >> This is also reflected by C2 not having a dmb for the same respective method. >> >> [1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) >> [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...) > > Samuel Chee has updated the pull request incrementally with one additional commit since the last revision: > > Add back in dmb membar for non-LSE > > Change-Id: Ie64565420a1758d3191eaebed82c80584ce54ef6 Have just updated with change to have it still emit a dmb when LSE is not enabled. Should be good to go now hopefully :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26000#issuecomment-3083935873 From mhaessig at openjdk.org Thu Jul 17 12:52:50 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 17 Jul 2025 12:52:50 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v2] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 15:07:59 GMT, Beno?t Maillard wrote: >> This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. >> >> The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. >> >> The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. >> >> Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) >> - [x] tier1-3, plus some internal testing >> - [x] Added test from the fuzzer >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: > > - Update test/hotspot/jtreg/compiler/c2/TestMaskAndRShiftReorder.java > > Co-authored-by: Manuel H?ssig > - Update test/hotspot/jtreg/compiler/c2/TestMaskAndRShiftReorder.java > > Co-authored-by: Manuel H?ssig > - Update src/hotspot/share/opto/phaseX.cpp > > Co-authored-by: Manuel H?ssig Looks good to me. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26347#pullrequestreview-3029402311 From kvn at openjdk.org Thu Jul 17 13:18:49 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 17 Jul 2025 13:18:49 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 11:09:09 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple patch? > > `CodeBuffer::copy_relocations_to(address buf, csize_t buf_limit, bool only_inst)` is only used in `copy_relocations_to(CodeBlob* dest)` which passes false to only_inst, so the former one should be able to be simplified. > > Thank you! Good. It is left over from JEP 243: Java-Level JVM Compiler Interface. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26366#pullrequestreview-3029502102 From mli at openjdk.org Thu Jul 17 14:22:25 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 17 Jul 2025 14:22:25 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall Message-ID: Hi, Can you help to review this patch? By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. Also add some comments and do some other simple cleanup. Thanks! ------------- Commit messages: - initial commit Changes: https://git.openjdk.org/jdk/pull/26370/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26370&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362515 Stats: 59 lines in 1 file changed: 7 ins; 7 del; 45 mod Patch: https://git.openjdk.org/jdk/pull/26370.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26370/head:pull/26370 PR: https://git.openjdk.org/jdk/pull/26370 From aph at openjdk.org Thu Jul 17 14:29:53 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 17 Jul 2025 14:29:53 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v2] In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 17 Jul 2025 12:43:39 GMT, Samuel Chee wrote: >> AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg: >> >> ;; cmpxchg { >> 0x0000e708d144cf60: mov x8, x2 >> 0x0000e708d144cf64: casal x8, x3, [x0] >> 0x0000e708d144cf68: cmp x8, x2 >> ;; 0x1F1F1F1F1F1F1F1F >> 0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f >> ;; } cmpxchg >> 0x0000e708d144cf70: cset x8, ne // ne = any >> 0x0000e708d144cf74: dmb ish >> >> >> According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2] >> >>> Atomically sets the value of a variable to the >>> newValue with the memory semantics of setVolatile if >>> the variable's current value, referred to as the witness >>> value, == the expectedValue, as accessed with the memory >>> semantics of getVolatile. >> >> >> >> Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen. >> >> Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant) >> >> This is also reflected by C2 not having a dmb for the same respective method. >> >> [1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) >> [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...) > > Samuel Chee has updated the pull request incrementally with one additional commit since the last revision: > > Add back in dmb membar for non-LSE > > Change-Id: Ie64565420a1758d3191eaebed82c80584ce54ef6 src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1487: > 1485: if(!UseLSE) { > 1486: __ membar(__ AnyAny); > 1487: } Suggestion: if(!UseLSE) { // Prevent a later volatile store from being reordered with the STLXR in cmpxchg. __ membar(__ StoreLoad); } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26000#discussion_r2213512465 From aph at openjdk.org Thu Jul 17 14:33:57 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 17 Jul 2025 14:33:57 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v2] In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 17 Jul 2025 14:27:23 GMT, Andrew Haley wrote: >> Samuel Chee has updated the pull request incrementally with one additional commit since the last revision: >> >> Add back in dmb membar for non-LSE >> >> Change-Id: Ie64565420a1758d3191eaebed82c80584ce54ef6 > > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1487: > >> 1485: if(!UseLSE) { >> 1486: __ membar(__ AnyAny); >> 1487: } > > Suggestion: > > if(!UseLSE) { > // Prevent a later volatile load from being reordered with the STLXR in cmpxchg. > __ membar(__ StoreLoad); > } I wonder if it might be a good idea to add a `trailingDMB` boolean argument to `cmpxchg` and `atomic_##NAME` instead. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26000#discussion_r2213522976 From jkarthikeyan at openjdk.org Thu Jul 17 15:01:48 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 17 Jul 2025 15:01:48 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v2] In-Reply-To: References: Message-ID: <6LXR3PdFz6_cBIQ8tkQCx-BvR5XcVHtcpB1oJv1PVAU=.8ab902f0-cb84-40f6-b4b7-b38ac591f3d6@github.com> On Wed, 16 Jul 2025 15:07:59 GMT, Beno?t Maillard wrote: >> This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. >> >> The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. >> >> The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. >> >> Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) >> - [x] tier1-3, plus some internal testing >> - [x] Added test from the fuzzer >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with three additional commits since the last revision: > > - Update test/hotspot/jtreg/compiler/c2/TestMaskAndRShiftReorder.java > > Co-authored-by: Manuel H?ssig > - Update test/hotspot/jtreg/compiler/c2/TestMaskAndRShiftReorder.java > > Co-authored-by: Manuel H?ssig > - Update src/hotspot/share/opto/phaseX.cpp > > Co-authored-by: Manuel H?ssig I think this is a nice fix for the notification issue. I just have one code style comment. src/hotspot/share/opto/phaseX.cpp line 2556: > 2554: for (DUIterator_Fast i2max, i2 = use->fast_outs(i2max); i2 < i2max; i2++) { > 2555: Node* u = use->fast_out(i2); > 2556: if (u->Opcode() == Op_RShiftI || u->Opcode() == Op_RShiftL ) { Suggestion: if (u->Opcode() == Op_RShiftI || u->Opcode() == Op_RShiftL) { ------------- Marked as reviewed by jkarthikeyan (Committer). PR Review: https://git.openjdk.org/jdk/pull/26347#pullrequestreview-3029700362 PR Review Comment: https://git.openjdk.org/jdk/pull/26347#discussion_r2213451788 From duke at openjdk.org Thu Jul 17 15:12:50 2025 From: duke at openjdk.org (Samuel Chee) Date: Thu, 17 Jul 2025 15:12:50 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v2] In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 17 Jul 2025 14:31:18 GMT, Andrew Haley wrote: >> src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1487: >> >>> 1485: if(!UseLSE) { >>> 1486: __ membar(__ AnyAny); >>> 1487: } >> >> Suggestion: >> >> if(!UseLSE) { >> // Prevent a later volatile load from being reordered with the STLXR in cmpxchg. >> __ membar(__ StoreLoad); >> } > > I wonder if it might be a good idea to add a `trailingDMB` boolean argument to `cmpxchg` and `atomic_##NAME` instead. Having a trailingDMB option is potentially a decent idea. Someone would probably need to investigate where the trailingDMB option would have to be enabled; I am not familiar enough to know where exactly would be affected by this. So for now I'd say leave it be and that is something someone else can maybe do in a later pr. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26000#discussion_r2213625095 From mli at openjdk.org Thu Jul 17 15:30:49 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 17 Jul 2025 15:30:49 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: <2jnLW9l_xgs35OLuqUXrnG9xvl9Nv_6M_KJnQwKqZqs=.27ffc0f3-0735-4058-814b-a022560be010@github.com> On Thu, 17 Jul 2025 12:14:08 GMT, Manuel H?ssig wrote: > Thank you for working on this cleanup, @Hamlin-Li! It looks good to me. > > I kicked off some testing on our side and will let you know what the results are. Thank you @mhaessig , will wait for your test result. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26366#issuecomment-3084505202 From mli at openjdk.org Thu Jul 17 15:30:50 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 17 Jul 2025 15:30:50 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 13:16:18 GMT, Vladimir Kozlov wrote: > Good. It is left over from JEP 243: Java-Level JVM Compiler Interface. Thank you @vnkozlov for reviewing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26366#issuecomment-3084506734 From duke at openjdk.org Thu Jul 17 15:36:32 2025 From: duke at openjdk.org (Samuel Chee) Date: Thu, 17 Jul 2025 15:36:32 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v3] In-Reply-To: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: > AtomicLong.CompareAndSet has the following assembly dump snippet which gets emitted from the intermediary LIRGenerator::atomic_cmpxchg: > > ;; cmpxchg { > 0x0000e708d144cf60: mov x8, x2 > 0x0000e708d144cf64: casal x8, x3, [x0] > 0x0000e708d144cf68: cmp x8, x2 > ;; 0x1F1F1F1F1F1F1F1F > 0x0000e708d144cf6c: mov x8, #0x1f1f1f1f1f1f1f1f > ;; } cmpxchg > 0x0000e708d144cf70: cset x8, ne // ne = any > 0x0000e708d144cf74: dmb ish > > > According to the Oracle Java Specification, AtomicLong.CompareAndSet [1] has the same memory effects as specified by VarHandle.compareAndSet which has the following effects: [2] > >> Atomically sets the value of a variable to the >> newValue with the memory semantics of setVolatile if >> the variable's current value, referred to as the witness >> value, == the expectedValue, as accessed with the memory >> semantics of getVolatile. > > > > Hence the release on the store due to setVolatile only occurs if the compare is successful. Since casal already satisfies these requirements, the dmb does not need to occur to ensure memory ordering in case the compare fails and a release does not happen. > > Hence we remove the dmb from both casl and casw (same logic applies to the non-long variant) > > This is also reflected by C2 not having a dmb for the same respective method. > > [1] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html#compareAndSet(long,long) > [2] https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/lang/invoke/VarHandle.html#compareAndSet(java.lang.Object...) Samuel Chee has updated the pull request incrementally with one additional commit since the last revision: Add comment Signed-off-by: Samuel Chee Change-Id: I9793ed6ffdff6c044552d069af23620d178f2284 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26000/files - new: https://git.openjdk.org/jdk/pull/26000/files/181ce0b7..8eb9096d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26000&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26000&range=01-02 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26000.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26000/head:pull/26000 PR: https://git.openjdk.org/jdk/pull/26000 From duke at openjdk.org Thu Jul 17 15:36:32 2025 From: duke at openjdk.org (Samuel Chee) Date: Thu, 17 Jul 2025 15:36:32 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v2] In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 17 Jul 2025 15:10:18 GMT, Samuel Chee wrote: >> I wonder if it might be a good idea to add a `trailingDMB` boolean argument to `cmpxchg` and `atomic_##NAME` instead. > > Having a trailingDMB option is potentially a decent idea. Someone would probably need to investigate where the trailingDMB option would have to be enabled; I am not familiar enough to know where exactly would be affected by this. > So for now I'd say leave it be and that is something someone else can maybe do in a later pr. Also have just added comment as you suggested thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26000#discussion_r2213678810 From duke at openjdk.org Thu Jul 17 16:19:51 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 17 Jul 2025 16:19:51 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [ ] Linux x64 fastdebug all > - [ ] Linux aarch64 fastdebug all > - [ ] ... Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: Require caller to hold locks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23573/files - new: https://git.openjdk.org/jdk/pull/23573/files/36834705..1dcf47e4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=37 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=36-37 Stats: 20 lines in 2 files changed: 12 ins; 6 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From jbhateja at openjdk.org Thu Jul 17 16:38:00 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 17 Jul 2025 16:38:00 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 10:43:34 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Refine lower bound computation Quick note on C2 Integral types:- - Integral types now encapsulate 3 lattice structures perf TypeInt/Long, namely signed, unsigned and knownbits. - All three lattice values are in sync post-canonicalization. - Lattice is a partial order relation, i.e., reflexive, transitive but anti-symmetric. - An integral lattice contains two special values: a TOP (no value, no assumption can be drawn by the compiler) and BOTTOM (all possible values in the value range) Verification ensures that the lattice is symmetrical around the centerline, i.e., a semi-lattice. - For a symmetrical lattice, only one operation i.e., meet/join is sufficient for value resolution; other operations can be computed by taking the dual of the first one using de-Morgan's law. _join = Dual (meet (dual(type1), dual(type2))_ - In theory, meet b/w two lattice points takes us to the greatest lower bound in the Hesse diagram, while join b/w two lattice points takes us to the lowest upper bound. Also, TOP represents the entire value range of the lattice, while BOTTOM represents no value, but C2 follows an inverted lattice convention. Inverted integral lattice hasse diagram TOP (no value) / | | \ -MIN ??.. MAX \ | | / BOTTOM (all possible values) - Thus, a MEET of two lattice points takes us to the greatest upper bound of the lattice structure; in this case, it's the union of two lattice points i.e., we pick the minimum of the lower bounds and max of the upper bounds of participating lattice points. JOIN takes us to the lowest upper-bound lattice points of the inverted lattice structure. in this case, it will be an intersection of lattice points, which constrains the value range i.e., we pick the max of the lower bounds and the minimum of the upper bounds of the two participating integral lattice points. e.g., if TypeInt t1 = {lo:10, hi:100} and TypeInt t2 = {lo:1, hi:20}, then t1.meet(t2) = lowest upper bound. = { lo = min(t1.lo, t2.lo}, hi = max(t1.hi , t2.hi}} = { lo = min(10, 1), hi = max(100, 20)} = { lo = 1, hi = 100} t1.join(t2) = dual (meet (dual(t1), dual(t2)) where dual = {lo : hi} => {hi : lo} = dual (meet (dual {lo : 10, hi : 100}, dual {lo : 1, hi : 20}}) = dual (meet ({lo: 100, hi : 10}}, {lo:20: , hi:1}) = dual (min(t1.lo, t2.lo}, max(t1.hi, t2.hi}) = dual (min(100, 20), max(10, 1)) = dual (lo:20, hi:10} = (lo : 10, hi : 20) Additional identities ? - TOP meet VAL = VAL since we cannot move to any other greatest lower bound when one of the inputs is TOP (unknown value), to move to the greatest lower bound both the inputs must be known values. - BOTTOM meet VAL = BOTTOM Now, some quick notes on CCP - Optimistic data flow analysis using ROPT walk on the ideal graph. - Each lattice begins with a TOP value, and analysis progressively adds elements to the lattice. Analysis expects to expand the value range with each data flow iteration, thereby monotonically increasing the lattice set. - After each value transformation, type verification checks that the new value is greater than the old value in the lattice, in other words new value should dominate the old value in the hasse diagram of the lattice. Thus, tnew->meet(told) gives us the lowest upper bound of two lattice points, i.e., tnew should be a superset of told. CCP is an optimistic iterative data flow analysis which traverses the ideal graph in RPOT order and reaches a fixed point once value transforms as no side-effects on the graph. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3084708275 From jbhateja at openjdk.org Thu Jul 17 16:38:01 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 17 Jul 2025 16:38:01 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v8] In-Reply-To: <10jWxhtjQENzTBjlNDFKhHQMN-ioETq3P6_qmVTq3bo=.0124e215-5c09-44c3-8dcb-cd692789907a@github.com> References: <10jWxhtjQENzTBjlNDFKhHQMN-ioETq3P6_qmVTq3bo=.0124e215-5c09-44c3-8dcb-cd692789907a@github.com> Message-ID: On Wed, 16 Jul 2025 06:49:06 GMT, Tobias Hartmann wrote: >> Hi @eme64 , >> >> Updated the tests as per suggestion; however, for this bug fix patch, we are not doing aggressive value range optimization. >> I plan to extend value routines for compress/expand with the newly supported knownBits infrastructure in a subsequent RFE., Following is a prototype for the same. >> >> https://github.com/jatin-bhateja/external_staging/blob/main/Code/java/knownBits_DFA/bit_compress_expand_KnownBits.java >> >> Best Regards, >> Jatin > > Thanks @jatin-bhateja. Isn't the OCA signature status verification independent of the PR? Let me ping a few people here to get it done. Hi @TobiHartmann . Please let me know if it's good to land in. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3084710343 From sparasa at openjdk.org Thu Jul 17 17:17:07 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 17 Jul 2025 17:17:07 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v5] In-Reply-To: References: Message-ID: > The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. > > In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. > > Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: change to push_ppx/pop_ppx ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25889/files - new: https://git.openjdk.org/jdk/pull/25889/files/8e6e96c2..78cbf243 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25889&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25889&range=03-04 Stats: 325 lines in 22 files changed: 0 ins; 0 del; 325 mod Patch: https://git.openjdk.org/jdk/pull/25889.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25889/head:pull/25889 PR: https://git.openjdk.org/jdk/pull/25889 From sparasa at openjdk.org Thu Jul 17 17:19:51 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 17 Jul 2025 17:19:51 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: <3LY0CQRtD6KU5wYpQSaCA9Cbey8yV7epET1OSUjngSw=.052051e9-138d-490f-80e8-226c0a4dd4a5@github.com> On Mon, 14 Jul 2025 17:44:15 GMT, Volodymyr Paprotski wrote: > My concerns have been addressed; thanks Vamsi for changing the names! Thank you Vlad for the reviewing the PR! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25889#issuecomment-3084838188 From sparasa at openjdk.org Thu Jul 17 17:19:54 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 17 Jul 2025 17:19:54 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v4] In-Reply-To: References: <-V4hpHvXdaDjmEyYzHcEpDJ2bzPTqoz2Ao8FLobkmB8=.d9e3b962-ae8d-4e4b-8ddb-c3ab42a2a619@github.com> Message-ID: <2Br9Fr3vnPluq7XaWc9PPrDvRgVF6UAOYdUFJ4IO23w=.f8ac1e39-dc16-429a-a8a0-c68a4f2a44ea@github.com> On Wed, 16 Jul 2025 20:55:35 GMT, Sandhya Viswanathan wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: >> >> - merge with master >> - remove pushp/popp from vm_version_x86 and also when APX is not being used >> - rename to paired_push and paired_pop >> - 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 798: > >> 796: } >> 797: >> 798: void MacroAssembler::paired_push(Register src) { > > Would be better to call these as push_ppx and pop_ppx. Please see the updated code changed to push_ppx/pop_ppx. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2213890625 From jbhateja at openjdk.org Thu Jul 17 17:29:49 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 17 Jul 2025 17:29:49 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v5] In-Reply-To: References: Message-ID: <89ItZsQ_nWl3KWuRwdAqu3cMeostYVb1sO6qurvJ5qw=.2640ac03-ea33-4938-86c1-40033dea04a8@github.com> On Thu, 17 Jul 2025 17:17:07 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology. >> >> In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0?R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs. >> >> Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > change to push_ppx/pop_ppx src/hotspot/cpu/x86/macroAssembler_x86.cpp line 806: > 804: } > 805: > 806: void MacroAssembler::pop_ppx(Register dst) { Hi @vamsi-parasa , If you rename pop_ppx to pop and push_ppx to push, it will cut down the changes in this patch significantly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2213907529 From sparasa at openjdk.org Thu Jul 17 18:41:55 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Thu, 17 Jul 2025 18:41:55 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v5] In-Reply-To: <89ItZsQ_nWl3KWuRwdAqu3cMeostYVb1sO6qurvJ5qw=.2640ac03-ea33-4938-86c1-40033dea04a8@github.com> References: <89ItZsQ_nWl3KWuRwdAqu3cMeostYVb1sO6qurvJ5qw=.2640ac03-ea33-4938-86c1-40033dea04a8@github.com> Message-ID: On Thu, 17 Jul 2025 17:26:56 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> change to push_ppx/pop_ppx > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 806: > >> 804: } >> 805: >> 806: void MacroAssembler::pop_ppx(Register dst) { > > Hi @vamsi-parasa , If you rename pop_ppx to pop and push_ppx to push, it will cut down the changes in this patch significantly. Hi Jatin (@jatin-bhateja), the intent is to make the use of the `pushp/popp` instructions explicit to the user, as not all `push` or `pop` instructions require the PPX feature. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2214035483 From thartmann at openjdk.org Thu Jul 17 18:59:58 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 17 Jul 2025 18:59:58 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 10:43:34 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Refine lower bound computation Thanks, testing looks good now! I'm out for the rest of the week and can review only next week. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3085103359 From jbhateja at openjdk.org Thu Jul 17 19:04:53 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 17 Jul 2025 19:04:53 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: On Mon, 14 Jul 2025 08:15:13 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> rename to paired_push and paired_pop > > src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 114: > >> 112: __ paired_push(rax); >> 113: } >> 114: __ paired_push(rcx); > > Hi @vamsi-parasa , for consecutive push/pop we can use push2/pop2 and 16byte alignment can be guaranteed using following technique > https://github.com/openjdk/jdk/pull/25351/files#diff-d5d721ebf93346ba66e81257e4f6c5e6268d59774313c61e97353c0dfbf686a5R94 > Hi Jatin (@jatin-bhateja), for the first iteration, would it be ok to get the push_paired/pop_paired changes integrated and then make the push2p/pop2p related optimizations in a separate PR? > > Thanks, Vamsi Please create a new RFE for this for tracking. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2214075909 From duke at openjdk.org Thu Jul 17 19:06:01 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 17 Jul 2025 19:06:01 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v24] In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 00:32:26 GMT, Vladimir Kozlov wrote: >> Thanks for pointing out the missing JVMTI event publication. I?m currently looking into what?s required to address that, along with JFR event publication that may also have been missed. I?d appreciate hearing others? thoughts on how critical this is: should we treat it as a blocker for integration, or would it be acceptable to follow up with a separate issue? >> >> We?re hoping to get this into JDK 25, as it would simplify both development and backporting of features related to hot code grouping. That said, if the consensus is that JVMTI/JFR support is essential upfront, this can be delayed until JDK 26. > >> We?re hoping to get this into JDK 25, as it would simplify both development and backporting of features related to hot code grouping. That said, if the consensus is that JVMTI/JFR support is essential upfront, this can be delayed until JDK 26. > > I don't think this can be put into JDK 25. Too late and changes are not simple. And yes, JVMTI/JFR support is essential - you have to support all public functionalities of VM. @vnkozlov When you get a chance, would you mind taking another look at this PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3085128980 From jbhateja at openjdk.org Thu Jul 17 19:38:50 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 17 Jul 2025 19:38:50 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v5] In-Reply-To: References: <89ItZsQ_nWl3KWuRwdAqu3cMeostYVb1sO6qurvJ5qw=.2640ac03-ea33-4938-86c1-40033dea04a8@github.com> Message-ID: On Thu, 17 Jul 2025 18:39:04 GMT, Srinivas Vamsi Parasa wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> change to push_ppx/pop_ppx > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 806: > >> 804: } >> 805: >> 806: void MacroAssembler::pop_ppx(Register dst) { > > Hi Jatin (@jatin-bhateja), the intent is to make the use of the `pushp/popp` instructions explicit to the user, as not all `push` or `pop` instructions require the PPX feature. BTW, push/popp are meager optimization hints, if other constraints for balancing are not met the it will prevent value forwarding, so it's ok to keep the macro-assembly name same as the assembler name, fully qualified names Assembler::pop/push are sufficient to disambiguate in applicable scenarios or iff there is any such need. I would have preferred keeping the same name to limit the changes in this patch. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2214135683 From sviswanathan at openjdk.org Thu Jul 17 19:52:59 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 17 Jul 2025 19:52:59 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v5] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 17:17:07 GMT, Srinivas Vamsi Parasa wrote: >> This PR adds support for the Push-Pop Acceleration (PPX) hint to legacy PUSH and POP instructions, enabling the PUSHP and POPP forms. The PPX hint improves performance by accelerating register value forwarding between matching push/pop pairs. >> >> **Purpose:** PPX is a performance hint that allows the processor to bypass memory and the training loop of Fast Store Forwarding Predictor (FSFP) by directly forwarding data between paired PUSHP and POPP instructions. >> >> **Requirements:** Both the PUSH and its matching POP must be marked with PPX. A "matching" pair accesses the same stack address (e.g., typical function prolog/epilog). Standalone PUSH instructions (e.g. for argument passing) must not be marked. >> >> **Encoding:** PUSHP/POPP is a replacement for legacy PUSH/POP (0x50+rd / 0x58+rd) and uses REX2.W = 1 (implies 64-bit operand size). PPX cannot be encoded with 16-bit operand size as REX2.W overrides the 0x66 prefix. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > change to push_ppx/pop_ppx Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25889#pullrequestreview-3030827697 From sviswanathan at openjdk.org Thu Jul 17 19:58:48 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 17 Jul 2025 19:58:48 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v5] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 17:17:07 GMT, Srinivas Vamsi Parasa wrote: >> This PR adds support for the Push-Pop Acceleration (PPX) hint to legacy PUSH and POP instructions, enabling the PUSHP and POPP forms. The PPX hint improves performance by accelerating register value forwarding between matching push/pop pairs. >> >> **Purpose:** PPX is a performance hint that allows the processor to bypass memory and the training loop of Fast Store Forwarding Predictor (FSFP) by directly forwarding data between paired PUSHP and POPP instructions. >> >> **Requirements:** Both the PUSH and its matching POP must be marked with PPX. A "matching" pair accesses the same stack address (e.g., typical function prolog/epilog). Standalone PUSH instructions (e.g. for argument passing) must not be marked. >> >> **Encoding:** PUSHP/POPP is a replacement for legacy PUSH/POP (0x50+rd / 0x58+rd) and uses REX2.W = 1 (implies 64-bit operand size). PPX cannot be encoded with 16-bit operand size as REX2.W overrides the 0x66 prefix. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > change to push_ppx/pop_ppx @vnkozlov Would it be possible for you to run this PR through your testing before we integrate? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25889#issuecomment-3085295891 From jbhateja at openjdk.org Fri Jul 18 03:17:54 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 18 Jul 2025 03:17:54 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: References: <3cr8Njt2flaQXy5sjOVOlhI9XDkEesagnYLwzCmgkoI=.089494aa-d622-47db-8d23-c9637519028c@github.com> Message-ID: <7zxjUTJq9ynYRau4UpWaFcARH8cp8Xka3cJovCwGVRY=.2bcd9dc6-a9df-47f2-8834-bc6c4a8469cf@github.com> On Mon, 7 Jul 2025 09:04:40 GMT, Jatin Bhateja wrote: >>> > > public static final VectorSpecies FSP = FloatVector.SPECIES_512; >>> > > public static long micro1(long a) { >>> > > long mask = Math.min(-1, Math.max(-1, a)); >>> > > return VectorMask.fromLong(FSP, mask).toLong(); >>> > > } >>> > > public static long micro2() { >>> > > return FSP.maskAll(true).toLong(); >>> > > } >>> > >>> > >>> > With this JMH method we can not see obvious performance improvement, because the hot spots are other instructions. Adding a loop is better. >>> >>> There is no hard and fast rule for the inclusion of a loop in a JMH micro in that case? >> >> You mean adding a loop is not a block, right ? > >> > > > public static final VectorSpecies FSP = FloatVector.SPECIES_512; >> > > > public static long micro1(long a) { >> > > > long mask = Math.min(-1, Math.max(-1, a)); >> > > > return VectorMask.fromLong(FSP, mask).toLong(); >> > > > } >> > > > public static long micro2() { >> > > > return FSP.maskAll(true).toLong(); >> > > > } >> > > >> > > >> > > With this JMH method we can not see obvious performance improvement, because the hot spots are other instructions. Adding a loop is better. >> > >> > >> > There is no hard and fast rule for the inclusion of a loop in a JMH micro in that case? >> >> You mean adding a loop is not a block, right ? > > Yes. If you see gains without loop go for it. > As @jatin-bhateja suggested, I have refactored the implementation and updated the commit message. please help review this PR, thanks! Thanks a lot @erifan , I am out for the rest of the week, will re-review early next week. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3086554704 From xgong at openjdk.org Fri Jul 18 06:05:51 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 18 Jul 2025 06:05:51 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> <4tejg5hp-eHBmAEvKbpTg_mv_TUYU5kg0HIccmWyac8=.3638758e-5000-4d1f-924f-abb4a21952c6@github.com> Message-ID: On Thu, 17 Jul 2025 11:28:18 GMT, Fei Gao wrote: > > Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion! > > Thanks! I?d suggest also highlighting `aarch64` in the JBS title, so others who are interested won?t miss it. Thanks for your point~ I'm not sure since this is not a pure AArch64 backend patch as I can see. Actually, the backend rules are so simple, and the mid-end IR change is relative more complex. Not sure whether this patch will be also missed by others that are not familiar with AArch64 if it is highlighted. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3087081866 From qxing at openjdk.org Fri Jul 18 06:19:52 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Fri, 18 Jul 2025 06:19:52 GMT Subject: RFR: 8347499: C2: Make `PhaseIdealLoop` eliminate more redundant safepoints in loops [v2] In-Reply-To: References: Message-ID: On Wed, 2 Apr 2025 07:22:13 GMT, Emanuel Peter wrote: >> The second question: >> >>> If we now removed safepoints in places where we would actually have needed them: how would we find out? I suppose we would get longer time to safepoint - higher latency in some cases. How would we catch this with our tests? >> >> I tried running tier1 tests with `JAVA_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+SafepointTimeout -XX:+AbortVMOnSafepointTimeout -XX:SafepointTimeoutDelay=1000`, and there were no failures. >> >> Running with `-XX:SafepointTimeoutDelay=500` caused 1 random JDK test case to fail. But then I tried to build a JDK without this patch, and it still had the random failure with this option. > > @MaxXSoft Would you mind improving the documentation comments, so that they are easier to understand? Maybe you can even add more comments around your code change, to "prove" why it is ok to do what we would do with your change? Hi @eme64, this PR is now ready for further reviews. Could you please continue to review this PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23057#issuecomment-3087162955 From kwei at openjdk.org Fri Jul 18 08:41:03 2025 From: kwei at openjdk.org (Kuai Wei) Date: Fri, 18 Jul 2025 08:41:03 GMT Subject: RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v17] In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 10:50:39 GMT, Kuai Wei wrote: >> In this patch, I extent the merge stores optimization to merge adjacents loads. Tier1 tests are passed in my machine. >> >> The benchmark result of MergeLoadBench.java >> AMD EPYC 9T24 96-Core Processor: >> >> |name | -MergeLoads | +MergeLoads |delta| >> |---|---|---|---| >> |MergeLoadBench.getCharB |4352.150 |4407.435 | 55.29 | >> |MergeLoadBench.getCharBU |4075.320 |4084.663 | 9.34 | >> |MergeLoadBench.getCharBV |3221.302 |3221.528 | 0.23 | >> |MergeLoadBench.getCharC |2235.433 |2238.796 | 3.36 | >> |MergeLoadBench.getCharL |4363.244 |4372.281 | 9.04 | >> |MergeLoadBench.getCharLU |4072.550 |4075.744 | 3.19 | >> |MergeLoadBench.getCharLV |2227.825 |2231.612 | 3.79 | >> |MergeLoadBench.getIntB |11199.935 |6869.030 | -4330.90 | >> |MergeLoadBench.getIntBU |6853.862 |2763.923 | -4089.94 | >> |MergeLoadBench.getIntBV |306.953 |309.911 | 2.96 | >> |MergeLoadBench.getIntL |10426.843 |6523.716 | -3903.13 | >> |MergeLoadBench.getIntLU |6740.847 |2602.701 | -4138.15 | >> |MergeLoadBench.getIntLV |2233.151 |2231.745 | -1.41 | >> |MergeLoadBench.getIntRB |11335.756 |8980.619 | -2355.14 | >> |MergeLoadBench.getIntRBU |7439.873 |3190.208 | -4249.66 | >> |MergeLoadBench.getIntRL |16323.040 |7786.842 | -8536.20 | >> |MergeLoadBench.getIntRLU |7457.745 |3364.140 | -4093.61 | >> |MergeLoadBench.getIntRU |2512.621 |2511.668 | -0.95 | >> |MergeLoadBench.getIntU |2501.064 |2500.629 | -0.43 | >> |MergeLoadBench.getLongB |21175.442 |21103.660 | -71.78 | >> |MergeLoadBench.getLongBU |14042.046 |2512.784 | -11529.26 | >> |MergeLoadBench.getLongBV |606.448 |606.171 | -0.28 | >> |MergeLoadBench.getLongL |23142.178 |23217.785 | 75.61 | >> |MergeLoadBench.getLongLU |14112.972 |2237.659 | -11875.31 | >> |MergeLoadBench.getLongLV |2230.416 |2231.224 | 0.81 | >> |MergeLoadBench.getLongRB |21152.558 |21140.583 | -11.98 | >> |MergeLoadBench.getLongRBU |14031.178 |2520.317 | -11510.86 | >> |MergeLoadBench.getLongRL |23248.506 |23136.410 | -112.10 | >> |MergeLoadBench.getLongRLU |14125.032 |2240.481 | -11884.55 | >> |Merg... > > Kuai Wei has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 24 commits: > > - Merge remote-tracking branch 'origin/master' into dev/merge_loads > - Move _merge_memops_checks into OrI/OrL > - Fix test error after merging > - Merge remote-tracking branch 'origin/master' into dev/merge_loads > - Fix for comments > - Fix build error on mac and windows > - Add check flag for combine operator > - Make MergeLoadInfoList an in-place growable array > - Fix for comments > - Merge remote-tracking branch 'origin/master' into dev/merge_loads > - ... and 14 more: https://git.openjdk.org/jdk/compare/8674f491...bdaae3ee It need rework to combine merge loads and merge stores in sperate optimize phase. Close it now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24023#issuecomment-3088524013 From kwei at openjdk.org Fri Jul 18 08:41:04 2025 From: kwei at openjdk.org (Kuai Wei) Date: Fri, 18 Jul 2025 08:41:04 GMT Subject: Withdrawn: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load In-Reply-To: References: Message-ID: <9mJe3dk0nRHoQ8IJVvKKu5Zua7xE7Py6p0Cw5yUK4gM=.35ac2c7c-29f5-45e7-a7b7-45db33b143de@github.com> On Thu, 13 Mar 2025 02:39:16 GMT, Kuai Wei wrote: > In this patch, I extent the merge stores optimization to merge adjacents loads. Tier1 tests are passed in my machine. > > The benchmark result of MergeLoadBench.java > AMD EPYC 9T24 96-Core Processor: > > |name | -MergeLoads | +MergeLoads |delta| > |---|---|---|---| > |MergeLoadBench.getCharB |4352.150 |4407.435 | 55.29 | > |MergeLoadBench.getCharBU |4075.320 |4084.663 | 9.34 | > |MergeLoadBench.getCharBV |3221.302 |3221.528 | 0.23 | > |MergeLoadBench.getCharC |2235.433 |2238.796 | 3.36 | > |MergeLoadBench.getCharL |4363.244 |4372.281 | 9.04 | > |MergeLoadBench.getCharLU |4072.550 |4075.744 | 3.19 | > |MergeLoadBench.getCharLV |2227.825 |2231.612 | 3.79 | > |MergeLoadBench.getIntB |11199.935 |6869.030 | -4330.90 | > |MergeLoadBench.getIntBU |6853.862 |2763.923 | -4089.94 | > |MergeLoadBench.getIntBV |306.953 |309.911 | 2.96 | > |MergeLoadBench.getIntL |10426.843 |6523.716 | -3903.13 | > |MergeLoadBench.getIntLU |6740.847 |2602.701 | -4138.15 | > |MergeLoadBench.getIntLV |2233.151 |2231.745 | -1.41 | > |MergeLoadBench.getIntRB |11335.756 |8980.619 | -2355.14 | > |MergeLoadBench.getIntRBU |7439.873 |3190.208 | -4249.66 | > |MergeLoadBench.getIntRL |16323.040 |7786.842 | -8536.20 | > |MergeLoadBench.getIntRLU |7457.745 |3364.140 | -4093.61 | > |MergeLoadBench.getIntRU |2512.621 |2511.668 | -0.95 | > |MergeLoadBench.getIntU |2501.064 |2500.629 | -0.43 | > |MergeLoadBench.getLongB |21175.442 |21103.660 | -71.78 | > |MergeLoadBench.getLongBU |14042.046 |2512.784 | -11529.26 | > |MergeLoadBench.getLongBV |606.448 |606.171 | -0.28 | > |MergeLoadBench.getLongL |23142.178 |23217.785 | 75.61 | > |MergeLoadBench.getLongLU |14112.972 |2237.659 | -11875.31 | > |MergeLoadBench.getLongLV |2230.416 |2231.224 | 0.81 | > |MergeLoadBench.getLongRB |21152.558 |21140.583 | -11.98 | > |MergeLoadBench.getLongRBU |14031.178 |2520.317 | -11510.86 | > |MergeLoadBench.getLongRL |23248.506 |23136.410 | -112.10 | > |MergeLoadBench.getLongRLU |14125.032 |2240.481 | -11884.55 | > |MergeLoadBench.getLongRU |3071.881 |3066.606 | -5.27 | > |Merg... This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/24023 From bmaillard at openjdk.org Fri Jul 18 08:52:34 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Fri, 18 Jul 2025 08:52:34 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v3] In-Reply-To: References: Message-ID: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. > > The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. > > The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. > > Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer > > Thank you for reviewing! Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/phaseX.cpp Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26347/files - new: https://git.openjdk.org/jdk/pull/26347/files/cc3ccc93..3d620328 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26347&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26347&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26347.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26347/head:pull/26347 PR: https://git.openjdk.org/jdk/pull/26347 From xgong at openjdk.org Fri Jul 18 09:01:53 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 18 Jul 2025 09:01:53 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 01:23:43 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Disable auto-vectorization of double to short conversion for NEON and update tests Hi @jatin-bhateja, would you mind help taking a look at the IR test part especially the IR check on X86? Thanks a lot in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3088643964 From fyang at openjdk.org Fri Jul 18 11:13:53 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 18 Jul 2025 11:13:53 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10] In-Reply-To: References: Message-ID: On Fri, 18 Jul 2025 09:07:54 GMT, Yuri Gaevsky wrote: >>> > Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)? >>> >>> You are right: the non-RVV version of intrinsic performs worse on BPI-F3 hardware with size > 70, though originally it was better on StarFive JH7110 and T-Head RVB-ICE, please see #16629. >> >> Hm, it is still good on Lichee Pi 4A: >> >> $ ( for i in "-XX:DisableIntrinsic=_vectorizedHashCode" " " ; do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java -jar benchmarks.jar --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" org.openjdk.bench.java.lang.ArraysHashCode.ints -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 -f 3 -r 1 -w 1 -wi 10 -i 10 2>&1 | tail -15 ) done ) >> --- -XX:DisableIntrinsic=_vectorizedHashCode --- >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.ints 1 avgt 30 51.709 ? 3.815 ns/op >> ArraysHashCode.ints 5 avgt 30 68.146 ? 1.833 ns/op >> ArraysHashCode.ints 10 avgt 30 89.217 ? 0.496 ns/op >> ArraysHashCode.ints 20 avgt 30 140.807 ? 9.335 ns/op >> ArraysHashCode.ints 30 avgt 30 172.030 ? 4.025 ns/op >> ArraysHashCode.ints 40 avgt 30 222.927 ? 10.342 ns/op >> ArraysHashCode.ints 50 avgt 30 251.719 ? 0.686 ns/op >> ArraysHashCode.ints 60 avgt 30 305.947 ? 10.532 ns/op >> ArraysHashCode.ints 70 avgt 30 347.602 ? 7.024 ns/op >> ArraysHashCode.ints 80 avgt 30 382.057 ? 1.520 ns/op >> ArraysHashCode.ints 90 avgt 30 426.022 ? 31.800 ns/op >> ArraysHashCode.ints 100 avgt 30 457.737 ? 0.652 ns/op >> ArraysHashCode.ints 200 avgt 30 913.501 ? 3.258 ns/op >> ArraysHashCode.ints 300 avgt 30 1297.355 ? 2.383 ns/op >> --- --- >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.ints 1 avgt 30 50.141 ? 0.463 ns/op >> ArraysHashCode.ints 5 avgt 30 62.921 ? 2.538 ns/op >> ArraysHashCode.ints 10 avgt 30 77.686 ? 2.577 ns/op >> ArraysHashCode.ints 20 avgt 30 102.736 ? 0.136 ns/op >> ArraysHashCode.ints 30 avgt 30 137.592 ? 4.232 ns/op >> ArraysHashCode.ints 40 avgt 30 157.376 ? 0.302 ns/op >> ArraysHashCode.ints 50 avgt 30 196.068 ? 3.812 ns/op >> ArraysHashCode.ints 60 avgt 30 212.... > >> Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)? > > I've just found that the following change: > > $ git diff > diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > index c62997310b3..f98b48adccd 100644 > --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > @@ -1953,16 +1953,15 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res > mv(pow31_3, 29791); // [31^^3] > mv(pow31_2, 961); // [31^^2] > > - slli(chunks_end, chunks, chunks_end_shift); > - add(chunks_end, ary, chunks_end); > + shadd(chunks_end, chunks, ary, t0, chunks_end_shift); > andi(cnt, cnt, stride - 1); // don't forget about tail! > > bind(WIDE_LOOP); > - mulw(result, result, pow31_4); // 31^^4 * h > arrays_hashcode_elload(t0, Address(ary, 0 * elsize), eltype); > arrays_hashcode_elload(t1, Address(ary, 1 * elsize), eltype); > arrays_hashcode_elload(tmp5, Address(ary, 2 * elsize), eltype); > arrays_hashcode_elload(tmp6, Address(ary, 3 * elsize), eltype); > + mulw(result, result, pow31_4); // 31^^4 * h > mulw(t0, t0, pow31_3); // 31^^3 * ary[i+0] > addw(result, result, t0); > mulw(t1, t1, pow31_2); // 31^^2 * ary[i+1] > @@ -1977,8 +1976,7 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res > beqz(cnt, DONE); > > bind(TAIL); > - slli(chunks_end, cnt, chunks_end_shift); > - add(chunks_end, ary, chunks_end); > + shadd(chunks_end, cnt, ary, t0, chunks_end_shift); > > bind(TAIL_LOOP); > arrays_hashcode_elload(t0, Address(ary), eltype); > > makes the numbers good again at BPI-F3 as well (mostly due to move `mulw` down in the loop): > > --- -XX:DisableIntrinsic=_vectorizedHashCode --- > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.ints 1 avgt 10 11.271 ? 0.003 ns/op > ArraysHashCode.ints 5 avgt 10 28.910 ? 0.036 ns/op > ArraysHashCode.ints 10 avgt 10 41.176 ? 0.383 ns/op > ArraysHashCode.ints 20 avgt 10 68.236 ? 0.087 ns/op > ArraysHashCode.ints 30 avgt 10 88.215 ? 0.272 ns/op > ArraysHashCode.ints 40 avgt 10 115.218 ? 0.065 ns/op > ArraysHashCode.ints 50 avgt 10 135.834 ? 0.374 ns/op > ArraysHashCode.in... @ygaevsky : Thanks for finding that. Could you please propose another PR to fix that? It looks like a micro-optimization for K1. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3089113887 From duke at openjdk.org Fri Jul 18 11:13:52 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Fri, 18 Jul 2025 11:13:52 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 12:41:47 GMT, Yuri Gaevsky wrote: > Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)? I've just found that the following change: $ git diff diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp index c62997310b3..f98b48adccd 100644 --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp @@ -1953,16 +1953,15 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res mv(pow31_3, 29791); // [31^^3] mv(pow31_2, 961); // [31^^2] - slli(chunks_end, chunks, chunks_end_shift); - add(chunks_end, ary, chunks_end); + shadd(chunks_end, chunks, ary, t0, chunks_end_shift); andi(cnt, cnt, stride - 1); // don't forget about tail! bind(WIDE_LOOP); - mulw(result, result, pow31_4); // 31^^4 * h arrays_hashcode_elload(t0, Address(ary, 0 * elsize), eltype); arrays_hashcode_elload(t1, Address(ary, 1 * elsize), eltype); arrays_hashcode_elload(tmp5, Address(ary, 2 * elsize), eltype); arrays_hashcode_elload(tmp6, Address(ary, 3 * elsize), eltype); + mulw(result, result, pow31_4); // 31^^4 * h mulw(t0, t0, pow31_3); // 31^^3 * ary[i+0] addw(result, result, t0); mulw(t1, t1, pow31_2); // 31^^2 * ary[i+1] @@ -1977,8 +1976,7 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res beqz(cnt, DONE); bind(TAIL); - slli(chunks_end, cnt, chunks_end_shift); - add(chunks_end, ary, chunks_end); + shadd(chunks_end, cnt, ary, t0, chunks_end_shift); bind(TAIL_LOOP); arrays_hashcode_elload(t0, Address(ary), eltype); makes the numbers good again at BPI-F3 as well (mostly due to move `mulw` down in the loop): --- -XX:DisableIntrinsic=_vectorizedHashCode --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 10 11.271 ? 0.003 ns/op ArraysHashCode.ints 5 avgt 10 28.910 ? 0.036 ns/op ArraysHashCode.ints 10 avgt 10 41.176 ? 0.383 ns/op ArraysHashCode.ints 20 avgt 10 68.236 ? 0.087 ns/op ArraysHashCode.ints 30 avgt 10 88.215 ? 0.272 ns/op ArraysHashCode.ints 40 avgt 10 115.218 ? 0.065 ns/op ArraysHashCode.ints 50 avgt 10 135.834 ? 0.374 ns/op ArraysHashCode.ints 60 avgt 10 162.042 ? 0.488 ns/op ArraysHashCode.ints 70 avgt 10 170.784 ? 0.538 ns/op ArraysHashCode.ints 80 avgt 10 194.294 ? 0.407 ns/op ArraysHashCode.ints 90 avgt 10 208.811 ? 0.289 ns/op ArraysHashCode.ints 100 avgt 10 231.826 ? 0.471 ns/op ArraysHashCode.ints 200 avgt 10 446.403 ? 0.491 ns/op ArraysHashCode.ints 300 avgt 10 655.815 ? 0.603 ns/op --- -XX:-UseRVV --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 10 11.281 ? 0.004 ns/op ArraysHashCode.ints 5 avgt 10 23.178 ? 0.011 ns/op ArraysHashCode.ints 10 avgt 10 33.183 ? 0.018 ns/op ArraysHashCode.ints 20 avgt 10 50.778 ? 0.027 ns/op ArraysHashCode.ints 30 avgt 10 70.892 ? 0.153 ns/op ArraysHashCode.ints 40 avgt 10 88.292 ? 0.018 ns/op ArraysHashCode.ints 50 avgt 10 108.978 ? 0.269 ns/op ArraysHashCode.ints 60 avgt 10 126.010 ? 0.064 ns/op ArraysHashCode.ints 70 avgt 10 146.115 ? 0.252 ns/op ArraysHashCode.ints 80 avgt 10 163.453 ? 0.078 ns/op ArraysHashCode.ints 90 avgt 10 184.433 ? 0.256 ns/op ArraysHashCode.ints 100 avgt 10 201.002 ? 0.036 ns/op ArraysHashCode.ints 200 avgt 10 388.929 ? 0.254 ns/op ArraysHashCode.ints 300 avgt 10 577.083 ? 0.325 ns/op And it's still good on other hardware mentioned earlier. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3088680935 From dbriemann at openjdk.org Fri Jul 18 13:21:00 2025 From: dbriemann at openjdk.org (David Briemann) Date: Fri, 18 Jul 2025 13:21:00 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts Message-ID: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. ------------- Commit messages: - 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts Changes: https://git.openjdk.org/jdk/pull/26388/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26388&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362602 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26388.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26388/head:pull/26388 PR: https://git.openjdk.org/jdk/pull/26388 From syan at openjdk.org Fri Jul 18 14:58:49 2025 From: syan at openjdk.org (SendaoYan) Date: Fri, 18 Jul 2025 14:58:49 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts In-Reply-To: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Fri, 18 Jul 2025 13:16:41 GMT, David Briemann wrote: > Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. test/hotspot/jtreg/compiler/lib/compile_framework/Compile.java line 43: > 41: class Compile { > 42: private static final int COMPILE_TIMEOUT = 60; > 43: private static final float timeoutFactor = Float.parseFloat(System.getProperty("test.timeout.factor", "1.0")); Should we update the copyright year. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26388#discussion_r2216275658 From galder at openjdk.org Fri Jul 18 17:02:56 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Fri, 18 Jul 2025 17:02:56 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v2] In-Reply-To: References: Message-ID: On Thu, 10 Jul 2025 14:24:07 GMT, Feilong Jiang wrote: > I can't really review it since I'm not familiar with neither riscv, ~nor the flag~ nor the COH logic. Of course I do know the flag ?! Sorry, a lot going on, I will provide a review ------------- PR Comment: https://git.openjdk.org/jdk/pull/25976#issuecomment-3089957669 From kvn at openjdk.org Fri Jul 18 17:58:50 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 18 Jul 2025 17:58:50 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 [v2] In-Reply-To: References: Message-ID: <4IPo0l9irNFt1HsnbWaV35OSGaBLnZ_nvu_65u7oByA=.eaf6bcce-7c73-47e9-bb50-ed465349b57c@github.com> On Tue, 15 Jul 2025 08:59:17 GMT, Aleksey Shipilev wrote: >> See the bug for more analysis. >> >> The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. >> >> There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. >> >> I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. >> >> This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. >> >> Additional testing: >> - [x] Linux AArch64 server fastdebug, `tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Also handle the corner case when compiler threads might be using the task @iwanowww please look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26294#issuecomment-3090241741 From duke at openjdk.org Fri Jul 18 19:29:56 2025 From: duke at openjdk.org (duke) Date: Fri, 18 Jul 2025 19:29:56 GMT Subject: Withdrawn: 8352141: UBSAN: fix the left shift of negative value in relocInfo.cpp, internal_word_Relocation::pack_data_to() In-Reply-To: References: Message-ID: On Mon, 24 Mar 2025 13:18:25 GMT, Afshin Zafari wrote: > The `offset` variable used in left-shift op can be a large number with its sign-bit set. This makes a negative value which is UB for left-shift and is reported as > `runtime error: left shift of negative value -25 at relocInfo.cpp:...` > > Using `java_left_shif()` function is the workaround to avoid UB. This function uses reinterpret_cast to cast from signed to unsigned and back. > > Tests: > linux-x64-debug tier1 on a UBSAN enabled build. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/24196 From vlivanov at openjdk.org Fri Jul 18 21:02:41 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 18 Jul 2025 21:02:41 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 [v2] In-Reply-To: References: Message-ID: <3S7A3vvhpPdXNU8_-NMJA99cqsHwMrVByf-nT_jdiA8=.044554dc-61df-490f-b582-2c276bdab309@github.com> On Tue, 15 Jul 2025 08:59:17 GMT, Aleksey Shipilev wrote: >> See the bug for more analysis. >> >> The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. >> >> There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. >> >> I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. >> >> This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. >> >> Additional testing: >> - [x] Linux AArch64 server fastdebug, `tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Also handle the corner case when compiler threads might be using the task Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26294#pullrequestreview-3034820180 From liach at openjdk.org Sat Jul 19 01:36:51 2025 From: liach at openjdk.org (Chen Liang) Date: Sat, 19 Jul 2025 01:36:51 GMT Subject: RFR: 8355223: Improve documentation on @IntrinsicCandidate [v7] In-Reply-To: References: Message-ID: <5_qPhVu2mYEDcDhm-xAgdB75_852NRgbkZBvYx2l50w=.930af007-8d62-4c9e-9f48-cbeaebc98cf3@github.com> On Wed, 21 May 2025 21:31:16 GMT, Chen Liang wrote: >> In offline discussion, we noted that the documentation on this annotation does not recommend minimizing the intrinsified section and moving whatever can be done in Java to Java; thus I prepared this documentation update, to shrink a "TLDR" essay to something concise for readers, such as pointing to that list at `vmIntrinsics.hpp` instead of "a list". > > Chen Liang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision: > > - More review updates > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Move intrinsic to be a subsection; just one most common function of the annotation > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Merge branch 'master' of https://github.com/openjdk/jdk into doc/intrinsic-candidate > - Update src/java.base/share/classes/jdk/internal/vm/annotation/IntrinsicCandidate.java > > Co-authored-by: Raffaello Giulietti > - Shorter first sentence > - Updates, thanks to John > - Refine validation and defensive copying > - 8355223: Improve documentation on @IntrinsicCandidate I think I will move this to a separate design document as Roger suggested. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24777#issuecomment-3091360211 From duke at openjdk.org Sat Jul 19 01:33:49 2025 From: duke at openjdk.org (duke) Date: Sat, 19 Jul 2025 01:33:49 GMT Subject: Withdrawn: 8355223: Improve documentation on @IntrinsicCandidate In-Reply-To: References: Message-ID: On Mon, 21 Apr 2025 19:29:44 GMT, Chen Liang wrote: > In offline discussion, we noted that the documentation on this annotation does not recommend minimizing the intrinsified section and moving whatever can be done in Java to Java; thus I prepared this documentation update, to shrink a "TLDR" essay to something concise for readers, such as pointing to that list at `vmIntrinsics.hpp` instead of "a list". This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/24777 From fjiang at openjdk.org Mon Jul 21 01:45:15 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 21 Jul 2025 01:45:15 GMT Subject: RFR: 8362838: RISC-V: Incorrect matching rule leading to improper oop instruction encoding Message-ID: Same as [JDK-8361892](https://bugs.openjdk.org/browse/JDK-8361892), but for riscv. Testing: - [x] Tier1-3 & hotspot:tier4 on linux-riscv64 ------------- Commit messages: - RISC-V: Incorrect matching rule leading to improper oop instruction encoding Changes: https://git.openjdk.org/jdk/pull/26318/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26318&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362838 Stats: 31 lines in 1 file changed: 0 ins; 31 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26318.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26318/head:pull/26318 PR: https://git.openjdk.org/jdk/pull/26318 From fyang at openjdk.org Mon Jul 21 02:25:38 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 21 Jul 2025 02:25:38 GMT Subject: RFR: 8362838: RISC-V: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 14:02:08 GMT, Feilong Jiang wrote: > Same as [JDK-8361892](https://bugs.openjdk.org/browse/JDK-8361892), but for riscv. > > Testing: > - [x] Tier1-3 & hotspot:tier4 on linux-riscv64 Look good to me. Thanks for fixing this. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26318#pullrequestreview-3036344155 From fjiang at openjdk.org Mon Jul 21 02:30:41 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 21 Jul 2025 02:30:41 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v4] In-Reply-To: References: <30vvzTU6W2p0YpB8Z9bSfO9ajO_fHh79q9cX1G3gz3k=.521b26d7-b606-4fdc-bdcf-41fd6c4891cc@github.com> Message-ID: On Thu, 10 Jul 2025 22:43:16 GMT, Dean Long wrote: >> Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Revert RISCV Macro modification >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses >> - riscv: fix c1 primitive array clone intrinsic regression > > src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp line 775: > >> 773: arraycopy_helper(x, &flags, &expected_type); >> 774: if (x->check_flag(Instruction::OmitChecksFlag)) { >> 775: flags = (flags & LIR_OpArrayCopy::unaligned); > > Should be LIR_OpArrayCopy::unaligned|LIR_OpArrayCopy::overlapping? See below. fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2218083182 From dzhang at openjdk.org Mon Jul 21 02:35:25 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 21 Jul 2025 02:35:25 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 Message-ID: Hi, Can you help to review this patch? Thanks! These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. We can refer to here: https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. ## Test (fastdebug) ### Test on k1 and qemu (w/ RVV, vlen=128) - compiler/vectorization/runner/LoopReductionOpTest.java - compiler/c2/irTests/TestIfMinMax.java - compiler/loopopts/superword/RedTest_long.java - compiler/loopopts/superword/SumRed_Long.java - compiler/loopopts/superword/TestGeneralizedReductions.java - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java ------------- Commit messages: - 8357694: RISC-V: Several IR verification tests fail when vlen=128 Changes: https://git.openjdk.org/jdk/pull/26408/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26408&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8357694 Stats: 19 lines in 6 files changed: 7 ins; 0 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/26408.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26408/head:pull/26408 PR: https://git.openjdk.org/jdk/pull/26408 From fyang at openjdk.org Mon Jul 21 04:17:44 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 21 Jul 2025 04:17:44 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 02:30:03 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. > > We can refer to here: > https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 > > According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. > > We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. > > ## Test (fastdebug) > ### Test on k1 and qemu (w/ RVV, vlen=128) > - compiler/vectorization/runner/LoopReductionOpTest.java > - compiler/c2/irTests/TestIfMinMax.java > - compiler/loopopts/superword/RedTest_long.java > - compiler/loopopts/superword/SumRed_Long.java > - compiler/loopopts/superword/TestGeneralizedReductions.java > - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java test/hotspot/jtreg/compiler/c2/irTests/TestIfMinMax.java line 298: > 296: > 297: @Test > 298: @IR(applyIfAnd = { "SuperWordReductions", "true", "MaxVectorSize", ">=32" }, Maybe we should only add this new requirement for RVV? Seems that it is not needed for avx512. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26408#discussion_r2218149766 From dzhang at openjdk.org Mon Jul 21 05:01:46 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 21 Jul 2025 05:01:46 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 [v2] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 04:15:19 GMT, Fei Yang wrote: >> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply fix only on RISC-V > > test/hotspot/jtreg/compiler/c2/irTests/TestIfMinMax.java line 298: > >> 296: >> 297: @Test >> 298: @IR(applyIfAnd = { "SuperWordReductions", "true", "MaxVectorSize", ">=32" }, > > Maybe we should only add this new requirement for RVV? Seems that it is not needed for avx512. Thanks for the review! I will add restrictions only for RISC-V. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26408#discussion_r2218189115 From dzhang at openjdk.org Mon Jul 21 05:01:45 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 21 Jul 2025 05:01:45 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 [v2] In-Reply-To: References: Message-ID: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> > Hi, > Can you help to review this patch? Thanks! > > These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. > > We can refer to here: > https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 > > According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. > > We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. > > ## Test (fastdebug) > ### Test on k1 and qemu (w/ RVV, vlen=128) > - compiler/vectorization/runner/LoopReductionOpTest.java > - compiler/c2/irTests/TestIfMinMax.java > - compiler/loopopts/superword/RedTest_long.java > - compiler/loopopts/superword/SumRed_Long.java > - compiler/loopopts/superword/TestGeneralizedReductions.java > - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: Apply fix only on RISC-V ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26408/files - new: https://git.openjdk.org/jdk/pull/26408/files/611a0d85..43815f2f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26408&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26408&range=00-01 Stats: 44 lines in 4 files changed: 29 ins; 0 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/26408.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26408/head:pull/26408 PR: https://git.openjdk.org/jdk/pull/26408 From shade at openjdk.org Mon Jul 21 06:06:49 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 21 Jul 2025 06:06:49 GMT Subject: RFR: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 [v2] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 08:59:17 GMT, Aleksey Shipilev wrote: >> See the bug for more analysis. >> >> The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. >> >> There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. >> >> I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. >> >> This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. >> >> Additional testing: >> - [x] Linux AArch64 server fastdebug, `tier1` >> - [x] Linux x86_64 server fastdebug, `all` > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Also handle the corner case when compiler threads might be using the task Thank you! I re-tested locally after local merge with current master, and it still works. Here goes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26294#issuecomment-3095320950 From shade at openjdk.org Mon Jul 21 06:06:49 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 21 Jul 2025 06:06:49 GMT Subject: Integrated: 8361752: Double free in CompileQueue::delete_all after JDK-8357473 In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 13:57:09 GMT, Aleksey Shipilev wrote: > See the bug for more analysis. > > The short summary is that `CompileQueue::delete_all` walks the entire compile queue and deletes the tasks. It normally goes smoothly, unless there are blocking tasks. Then, the actual waiters have to delete the task, lest we delete the task under waiter's feet. Full deletion and blocking waits coordinate with `waiting_for_completion_count` counter. This mechanism -- added by [JDK-8343938](https://bugs.openjdk.org/browse/JDK-8343938) in JDK 25 to solve a similar problem -- almost works. _Almost_. > > There is a subtle race window, where blocking waiter could have already unparked, dropped `waiting_for_completion_count` to `0` and proceeded to delete the task, see `CompileBroker::wait_for_completion()`. Then the queue deletion code could assume there are _no actual waiters_ on the blocking task, and proceed to delete the task _again_. Before [JDK-8357473](https://bugs.openjdk.org/browse/JDK-8357473) this race was fairly innocuous, as second attempt at insertion into the free list was benign. But now, `CompileTask`-s are `delete`-d, and the second attempt leads to double free. > > I suspect we can fix that by complicating the coordination protocol even further, e.g. by tracking the counters more thoroughly. But, recognizing `CompileQueue::delete_all()` is basically only called from the compiler shutdown code (things are already bad), and it looks completely opportunistic (it does not delete the whole compiler _threads_, so skipping synchronous deletes on a few compile tasks are not a big deal), we should strive to simplify it. > > This PR summarily delegates _all_ blocking task deletes to waiters. I think it stands to reason (and can be seen in `CompilerBroker` code) that if a blocking task is in queue, then there _is_ a waiter that would call `CompileBroker::wait_for_completion()` on it. > > Additional testing: > - [x] Linux AArch64 server fastdebug, `tier1` > - [x] Linux x86_64 server fastdebug, `all` This pull request has now been integrated. Changeset: 9609f57c Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/9609f57cef684d2f44d3e12a3522811a3c0776f4 Stats: 71 lines in 5 files changed: 19 ins; 40 del; 12 mod 8361752: Double free in CompileQueue::delete_all after JDK-8357473 Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/26294 From yadongwang at openjdk.org Mon Jul 21 06:52:43 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Mon, 21 Jul 2025 06:52:43 GMT Subject: RFR: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding [v2] In-Reply-To: References: Message-ID: On Sun, 13 Jul 2025 08:40:45 GMT, Yadong Wang wrote: >> The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. >> >> C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. >> >> // The assembler store_check code will do an unsigned shift of the oop, >> // then add it to _byte_map_base, i.e. >> // >> // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) >> _byte_map = (CardValue*) rs.base(); >> _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); >> >> In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. >> >> // Card Table Byte Map Base >> operand immByteMapBase() >> %{ >> // Get base of card map >> predicate((jbyte*)n->get_ptr() == >> ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); >> match(ConP); >> >> op_cost(0); >> format %{ %} >> interface(CONST_INTER); >> %} >> >> // Load Byte Map Base Constant >> instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) >> %{ >> match(Set dst con); >> >> ins_cost(INSN_COST); >> format %{ "adr $dst, $con\t# Byte Map Base" %} >> >> ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); >> >> ins_pipe(ialu_imm); >> %} >> >> As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: >> 0xffff25caf08c: ldaxr x8, [x11] >> 0xffff25caf090: cmp x10, x8 >> 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any >> 0xffff25caf098: stlxr w8, x28, [x11] >> 0xffff25caf09c: cbnz w8, 0xffff25caf08c >> 0xffff25caf0a0: orr x11, xzr, #0x3 >> 0xffff25caf0a4: str x11, [x13] >> 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none >> 0xffff25caf0ac: str x14, [sp] >> 0xffff25caf0b0: add x2, sp, #0x20 >> 0xffff25caf0b4: adrp x1, 0xffff21730000 >> 0xffff25caf0b8: bl 0xffff256fffc0 >> 0xffff25caf0bc: ldr x14, [sp] >> 0xffff25caf0c0: b 0xffff25caef80 >> 0xffff25caf0c4: add x13, sp, #0x20 >> 0xffff25caf0c8: adrp x12, 0xffff21730000 >> 0xffff25caf0cc: ldr x10, [x13] >> 0xffff25caf0d0: cmp x10, xzr >> 0xffff25c... > > Yadong Wang has updated the pull request incrementally with one additional commit since the last revision: > > 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding Does anyone have any questions about this modification proposal? If not, it will be integrated tomorrow. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26249#issuecomment-3095455340 From thartmann at openjdk.org Mon Jul 21 07:22:54 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 21 Jul 2025 07:22:54 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 10:43:34 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Refine lower bound computation This looks good to me and I think you addressed all the comments that Emanuel had. Let's wait for another day or two in case someone else wants to take a look as well. In the meantime, please request approval for integration into JDK 25 since we are know at RDP 2: https://openjdk.org/jeps/3#Fix-Request-Process ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23947#pullrequestreview-3036859676 From jbhateja at openjdk.org Mon Jul 21 07:26:55 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 21 Jul 2025 07:26:55 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 09:09:14 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Refactor the implementation > > Do the convertion in C2's IGVN phase to cover more cases. > - Merge branch 'master' into JDK-8356760 > - Simplify the test code > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 src/hotspot/share/opto/vectorIntrinsics.cpp line 692: > 690: // generate a MaskAll or Replicate instead. > 691: > 692: // The "maskAll" API uses the corresponding integer types for floating-point data. This is because mask all only accepts -1 and 0 values, since -1.0f in float in IEEE 754 format does not set all bits hence an floating point to integral conversion is mandatory here. src/hotspot/share/opto/vectornode.cpp line 1520: > 1518: uint vlen = vt->length(); > 1519: BasicType bt = vt->element_basic_type(); > 1520: int opc = is_mask ? Op_MaskAll : Op_Replicate; You can remove this check, since VectorNode::scalar2vector alreday has a match rule for Op_MaskAll src/hotspot/share/opto/vectornode.cpp line 1532: > 1530: } else { > 1531: con = phase->intcon(con_value); > 1532: } Suggestion: phase->makecon(TypeInteger::make(bits_type->get_con(), maskall_bt) src/hotspot/share/opto/vectornode.cpp line 1544: > 1542: > 1543: Node* VectorLoadMaskNode::Ideal(PhaseGVN* phase, bool can_reshape) { > 1544: // VectorLoadMask(VectorLongToMask(-1/0)) => Replicate(-1/0) FTR: This is only useful for non-predicated targets. Since on predicated target VectorLongToMask is not succeeded by VectorLoadMask https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L703 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2218327515 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2218349016 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2218364699 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2218386052 From thartmann at openjdk.org Mon Jul 21 07:30:41 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 21 Jul 2025 07:30:41 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v3] In-Reply-To: References: Message-ID: On Fri, 18 Jul 2025 08:52:34 GMT, Beno?t Maillard wrote: >> This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. >> >> The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. >> >> The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. >> >> Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) >> - [x] tier1-3, plus some internal testing >> - [x] Added test from the fuzzer >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/phaseX.cpp > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26347#pullrequestreview-3036879159 From duke at openjdk.org Mon Jul 21 07:33:40 2025 From: duke at openjdk.org (duke) Date: Mon, 21 Jul 2025 07:33:40 GMT Subject: RFR: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v3] In-Reply-To: References: Message-ID: On Fri, 18 Jul 2025 08:52:34 GMT, Beno?t Maillard wrote: >> This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. >> >> The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. >> >> The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. >> >> Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) >> - [x] tier1-3, plus some internal testing >> - [x] Added test from the fuzzer >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/phaseX.cpp > > Co-authored-by: Jasmine Karthikeyan <25208576+jaskarth at users.noreply.github.com> @benoitmaillard Your change (at version 3d620328615205749d2de6bd7705a2cdb4df506c) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26347#issuecomment-3095563954 From bmaillard at openjdk.org Mon Jul 21 07:40:52 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 21 Jul 2025 07:40:52 GMT Subject: Integrated: 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist In-Reply-To: References: Message-ID: <2050toeE95z9-ARCuAGTtgmxrTcZbcAOwR31CJ5NGg0=.7f52002f-3984-489e-8c92-dc8d167a418e@github.com> On Wed, 16 Jul 2025 12:42:32 GMT, Beno?t Maillard wrote: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist`. > > The affected optimization is the transformation of `(x & mask) >> shift` into `(x >> shift) & (mask >> shift)`, where `mask` is a constant. This transformation is handled in `RShiftNode::IdealIL` for both `RShiftI` and `RShiftL` nodes. > > The dependency of this optimization extends beyond a direct input: from the viewpoint of a shift node, it relies on changes to the inputs of its inputs (i.e., an `AndI`/`AndL` input node to the shift). Therefore, when the `And` node changes, the corresponding shift node must be notified to allow the optimization to take place. > > Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8361700) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer > > Thank you for reviewing! This pull request has now been integrated. Changeset: 62a58062 Author: Beno?t Maillard Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/62a58062e5f3d0a723608d98d2412ea779f73897 Stats: 69 lines in 2 files changed: 69 ins; 0 del; 0 mod 8361700: Missed optimization in PhaseIterGVN for mask and shift patterns due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist Reviewed-by: thartmann, mchevalier, mhaessig, jkarthikeyan ------------- PR: https://git.openjdk.org/jdk/pull/26347 From thartmann at openjdk.org Mon Jul 21 07:52:05 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 21 Jul 2025 07:52:05 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v37] In-Reply-To: References: Message-ID: <60gcSg6iIl4M7_jwgscpCL7GC7AQj4yulI1Jejo4W3E=.2ff2d416-d464-4d4e-a332-1205f56eafef@github.com> On Wed, 16 Jul 2025 09:38:18 GMT, Roland Westrelin wrote: >> To optimize a long counted loop and long range checks in a long or int >> counted loop, the loop is turned into a loop nest. When the loop has >> few iterations, the overhead of having an outer loop whose backedge is >> never taken, has a measurable cost. Furthermore, creating the loop >> nest usually causes one iteration of the loop to be peeled so >> predicates can be set up. If the loop is short running, then it's an >> extra iteration that's run with range checks (compared to an int >> counted loop with int range checks). >> >> This change doesn't create a loop nest when: >> >> 1- it can be determined statically at loop nest creation time that the >> loop runs for a short enough number of iterations >> >> 2- profiling reports that the loop runs for no more than ShortLoopIter >> iterations (1000 by default). >> >> For 2-, a guard is added which is implemented as yet another predicate. >> >> While this change is in principle simple, I ran into a few >> implementation issues: >> >> - while c2 has a way to compute the number of iterations of an int >> counted loop, it doesn't have that for long counted loop. The >> existing logic for int counted loops promotes values to long to >> avoid overflows. I reworked it so it now works for both long and int >> counted loops. >> >> - I added a new deoptimization reason (Reason_short_running_loop) for >> the new predicate. Given the number of iterations is narrowed down >> by the predicate, the limit of the loop after transformation is a >> cast node that's control dependent on the short running loop >> predicate. Because once the counted loop is transformed, it is >> likely that range check predicates will be inserted and they will >> depend on the limit, the short running loop predicate has to be the >> one that's further away from the loop entry. Now it is also possible >> that the limit before transformation depends on a predicate >> (TestShortRunningLongCountedLoopPredicatesClone is an example), we >> can have: new predicates inserted after the transformation that >> depend on the casted limit that itself depend on old predicates >> added before the transformation. To solve this cicular dependency, >> parse and assert predicates are cloned between the old predicates >> and the loop head. The cloned short running loop parse predicate is >> the one that's used to insert the short running loop predicate. >> >> - In the case of a long counted loop, the loop is transformed into a >> regular loop with a ... > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > test failures Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-3036936379 From fyang at openjdk.org Mon Jul 21 07:55:43 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 21 Jul 2025 07:55:43 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 14:17:45 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. > NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. > Also add some comments and do some other simple cleanup. > > Thanks! src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 131: > 129: } > 130: > 131: bool RelocCall::set_destination_mt_safe(address dest, bool assert_lock) { Seens you need to merge latest HEAD and rebase. The `assert_lock` param of `NativeFarCall::set_destination_mt_safe` has been removed recently. src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 190: > 188: assert(code != nullptr, "Could not find the containing code blob"); > 189: > 190: address dest = MacroAssembler::target_addr_for_insn(call_addr); Is this change safe? Seems it modifies the original logic. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2218446799 PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2218444040 From dbriemann at openjdk.org Mon Jul 21 08:00:29 2025 From: dbriemann at openjdk.org (David Briemann) Date: Mon, 21 Jul 2025 08:00:29 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v2] In-Reply-To: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: > Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. David Briemann has updated the pull request incrementally with one additional commit since the last revision: update copyright header ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26388/files - new: https://git.openjdk.org/jdk/pull/26388/files/7853b2a5..87fdf248 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26388&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26388&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26388.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26388/head:pull/26388 PR: https://git.openjdk.org/jdk/pull/26388 From duke at openjdk.org Mon Jul 21 08:10:45 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Mon, 21 Jul 2025 08:10:45 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10] In-Reply-To: References: Message-ID: On Fri, 18 Jul 2025 09:07:54 GMT, Yuri Gaevsky wrote: >>> > Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)? >>> >>> You are right: the non-RVV version of intrinsic performs worse on BPI-F3 hardware with size > 70, though originally it was better on StarFive JH7110 and T-Head RVB-ICE, please see #16629. >> >> Hm, it is still good on Lichee Pi 4A: >> >> $ ( for i in "-XX:DisableIntrinsic=_vectorizedHashCode" " " ; do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java -jar benchmarks.jar --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" org.openjdk.bench.java.lang.ArraysHashCode.ints -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 -f 3 -r 1 -w 1 -wi 10 -i 10 2>&1 | tail -15 ) done ) >> --- -XX:DisableIntrinsic=_vectorizedHashCode --- >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.ints 1 avgt 30 51.709 ? 3.815 ns/op >> ArraysHashCode.ints 5 avgt 30 68.146 ? 1.833 ns/op >> ArraysHashCode.ints 10 avgt 30 89.217 ? 0.496 ns/op >> ArraysHashCode.ints 20 avgt 30 140.807 ? 9.335 ns/op >> ArraysHashCode.ints 30 avgt 30 172.030 ? 4.025 ns/op >> ArraysHashCode.ints 40 avgt 30 222.927 ? 10.342 ns/op >> ArraysHashCode.ints 50 avgt 30 251.719 ? 0.686 ns/op >> ArraysHashCode.ints 60 avgt 30 305.947 ? 10.532 ns/op >> ArraysHashCode.ints 70 avgt 30 347.602 ? 7.024 ns/op >> ArraysHashCode.ints 80 avgt 30 382.057 ? 1.520 ns/op >> ArraysHashCode.ints 90 avgt 30 426.022 ? 31.800 ns/op >> ArraysHashCode.ints 100 avgt 30 457.737 ? 0.652 ns/op >> ArraysHashCode.ints 200 avgt 30 913.501 ? 3.258 ns/op >> ArraysHashCode.ints 300 avgt 30 1297.355 ? 2.383 ns/op >> --- --- >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.ints 1 avgt 30 50.141 ? 0.463 ns/op >> ArraysHashCode.ints 5 avgt 30 62.921 ? 2.538 ns/op >> ArraysHashCode.ints 10 avgt 30 77.686 ? 2.577 ns/op >> ArraysHashCode.ints 20 avgt 30 102.736 ? 0.136 ns/op >> ArraysHashCode.ints 30 avgt 30 137.592 ? 4.232 ns/op >> ArraysHashCode.ints 40 avgt 30 157.376 ? 0.302 ns/op >> ArraysHashCode.ints 50 avgt 30 196.068 ? 3.812 ns/op >> ArraysHashCode.ints 60 avgt 30 212.... > >> Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)? > > I've just found that the following change: > > $ git diff > diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > index c62997310b3..f98b48adccd 100644 > --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp > @@ -1953,16 +1953,15 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res > mv(pow31_3, 29791); // [31^^3] > mv(pow31_2, 961); // [31^^2] > > - slli(chunks_end, chunks, chunks_end_shift); > - add(chunks_end, ary, chunks_end); > + shadd(chunks_end, chunks, ary, t0, chunks_end_shift); > andi(cnt, cnt, stride - 1); // don't forget about tail! > > bind(WIDE_LOOP); > - mulw(result, result, pow31_4); // 31^^4 * h > arrays_hashcode_elload(t0, Address(ary, 0 * elsize), eltype); > arrays_hashcode_elload(t1, Address(ary, 1 * elsize), eltype); > arrays_hashcode_elload(tmp5, Address(ary, 2 * elsize), eltype); > arrays_hashcode_elload(tmp6, Address(ary, 3 * elsize), eltype); > + mulw(result, result, pow31_4); // 31^^4 * h > mulw(t0, t0, pow31_3); // 31^^3 * ary[i+0] > addw(result, result, t0); > mulw(t1, t1, pow31_2); // 31^^2 * ary[i+1] > @@ -1977,8 +1976,7 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res > beqz(cnt, DONE); > > bind(TAIL); > - slli(chunks_end, cnt, chunks_end_shift); > - add(chunks_end, ary, chunks_end); > + shadd(chunks_end, cnt, ary, t0, chunks_end_shift); > > bind(TAIL_LOOP); > arrays_hashcode_elload(t0, Address(ary), eltype); > > makes the numbers good again at BPI-F3 as well (mostly due to move `mulw` down in the loop): > > --- -XX:DisableIntrinsic=_vectorizedHashCode --- > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.ints 1 avgt 10 11.271 ? 0.003 ns/op > ArraysHashCode.ints 5 avgt 10 28.910 ? 0.036 ns/op > ArraysHashCode.ints 10 avgt 10 41.176 ? 0.383 ns/op > ArraysHashCode.ints 20 avgt 10 68.236 ? 0.087 ns/op > ArraysHashCode.ints 30 avgt 10 88.215 ? 0.272 ns/op > ArraysHashCode.ints 40 avgt 10 115.218 ? 0.065 ns/op > ArraysHashCode.ints 50 avgt 10 135.834 ? 0.374 ns/op > ArraysHashCode.in... > @ygaevsky : Thanks for finding that. Could you please propose another PR to fix that? It looks like a micro-optimization for K1. Sure, done: please see https://github.com/openjdk/jdk/pull/26409. ------------- PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3095665554 From duke at openjdk.org Mon Jul 21 08:13:25 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Mon, 21 Jul 2025 08:13:25 GMT Subject: RFR: 8362596: RISC-V: Improve _vectorizedHashCode intrinsic Message-ID: This is a micro-optimization for RISC-V SpacemiT K1 CPU to fix [encountered performance regression](https://github.com/openjdk/jdk/pull/17413#issuecomment-3082664335). ------------- Commit messages: - 8362596: RISC-V: Improve _vectorizedHashCode intrinsic Changes: https://git.openjdk.org/jdk/pull/26409/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26409&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362596 Stats: 6 lines in 1 file changed: 1 ins; 3 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26409.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26409/head:pull/26409 PR: https://git.openjdk.org/jdk/pull/26409 From duke at openjdk.org Mon Jul 21 08:13:25 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Mon, 21 Jul 2025 08:13:25 GMT Subject: RFR: 8362596: RISC-V: Improve _vectorizedHashCode intrinsic In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 08:07:48 GMT, Yuri Gaevsky wrote: > This is a micro-optimization for RISC-V SpacemiT K1 CPU to fix [encountered performance regression](https://github.com/openjdk/jdk/pull/17413#issuecomment-3082664335). bpif3-16g% for i in "-XX:DisableIntrinsic=_vectorizedHashCode" "-XX:-UseRVV" "-XX:+UseRVV" ; \ do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java -jar benchmarks.jar \ --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" \ org.openjdk.bench.java.lang.ArraysHashCode.ints \ -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 \ -f 3 -r 1 -w 1 -wi 5 -i 10 2>&1 | tail -15 ) done --- -XX:DisableIntrinsic=_vectorizedHashCode --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.280 ? 0.004 ns/op ArraysHashCode.ints 5 avgt 30 28.831 ? 0.032 ns/op ArraysHashCode.ints 10 avgt 30 41.179 ? 0.220 ns/op ArraysHashCode.ints 20 avgt 30 68.178 ? 0.142 ns/op ArraysHashCode.ints 30 avgt 30 88.911 ? 0.385 ns/op ArraysHashCode.ints 40 avgt 30 115.043 ? 0.186 ns/op ArraysHashCode.ints 50 avgt 30 135.755 ? 0.607 ns/op ArraysHashCode.ints 60 avgt 30 161.924 ? 0.187 ns/op ArraysHashCode.ints 70 avgt 30 170.367 ? 0.413 ns/op ArraysHashCode.ints 80 avgt 30 193.985 ? 0.707 ns/op ArraysHashCode.ints 90 avgt 30 207.633 ? 0.147 ns/op ArraysHashCode.ints 100 avgt 30 232.362 ? 0.259 ns/op ArraysHashCode.ints 200 avgt 30 447.390 ? 0.677 ns/op ArraysHashCode.ints 300 avgt 30 656.324 ? 1.100 ns/op --- -XX:-UseRVV --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.291 ? 0.017 ns/op ArraysHashCode.ints 5 avgt 30 23.176 ? 0.011 ns/op ArraysHashCode.ints 10 avgt 30 33.264 ? 0.073 ns/op ArraysHashCode.ints 20 avgt 30 50.726 ? 0.006 ns/op ArraysHashCode.ints 30 avgt 30 70.990 ? 0.184 ns/op ArraysHashCode.ints 40 avgt 30 88.368 ? 0.050 ns/op ArraysHashCode.ints 50 avgt 30 108.582 ? 0.175 ns/op ArraysHashCode.ints 60 avgt 30 126.084 ? 0.202 ns/op ArraysHashCode.ints 70 avgt 30 146.201 ? 0.169 ns/op ArraysHashCode.ints 80 avgt 30 163.469 ? 0.033 ns/op ArraysHashCode.ints 90 avgt 30 183.653 ? 0.182 ns/op ArraysHashCode.ints 100 avgt 30 201.063 ? 0.156 ns/op ArraysHashCode.ints 200 avgt 30 389.657 ? 0.752 ns/op ArraysHashCode.ints 300 avgt 30 577.283 ? 0.434 ns/op --- -XX:+UseRVV --- Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.ints 1 avgt 30 11.273 ? 0.001 ns/op ArraysHashCode.ints 5 avgt 30 23.184 ? 0.010 ns/op ArraysHashCode.ints 10 avgt 30 33.262 ? 0.086 ns/op ArraysHashCode.ints 20 avgt 30 50.811 ? 0.078 ns/op ArraysHashCode.ints 30 avgt 30 71.019 ? 0.164 ns/op ArraysHashCode.ints 40 avgt 30 88.331 ? 0.096 ns/op ArraysHashCode.ints 50 avgt 30 108.852 ? 0.212 ns/op ArraysHashCode.ints 60 avgt 30 125.948 ? 0.039 ns/op ArraysHashCode.ints 70 avgt 30 146.518 ? 0.345 ns/op ArraysHashCode.ints 80 avgt 30 163.464 ? 0.029 ns/op ArraysHashCode.ints 90 avgt 30 183.722 ? 0.237 ns/op ArraysHashCode.ints 100 avgt 30 201.307 ? 0.346 ns/op ArraysHashCode.ints 200 avgt 30 389.048 ? 0.322 ns/op ArraysHashCode.ints 300 avgt 30 576.821 ? 0.130 ns/op ------------- PR Comment: https://git.openjdk.org/jdk/pull/26409#issuecomment-3095669754 From mhaessig at openjdk.org Mon Jul 21 08:14:42 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 21 Jul 2025 08:14:42 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 [v2] In-Reply-To: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> References: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> Message-ID: On Mon, 21 Jul 2025 05:01:45 GMT, Dingli Zhang wrote: >> Hi, >> Can you help to review this patch? Thanks! >> >> These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. >> >> We can refer to here: >> https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 >> >> According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. >> >> We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. >> >> ## Test (fastdebug) >> ### Test on k1 and qemu (w/ RVV, vlen=128) >> - compiler/vectorization/runner/LoopReductionOpTest.java >> - compiler/c2/irTests/TestIfMinMax.java >> - compiler/loopopts/superword/RedTest_long.java >> - compiler/loopopts/superword/SumRed_Long.java >> - compiler/loopopts/superword/TestGeneralizedReductions.java >> - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Apply fix only on RISC-V Thank you for this PR, @DingliZhang. It looks good to me. I kicked off testing on our side and will keep you posted on the results. ------------- PR Review: https://git.openjdk.org/jdk/pull/26408#pullrequestreview-3036997514 From fyang at openjdk.org Mon Jul 21 08:17:44 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 21 Jul 2025 08:17:44 GMT Subject: RFR: 8362596: RISC-V: Improve _vectorizedHashCode intrinsic In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 08:07:48 GMT, Yuri Gaevsky wrote: > This is a micro-optimization for RISC-V SpacemiT K1 CPU to fix [encountered performance regression](https://github.com/openjdk/jdk/pull/17413#issuecomment-3082664335). Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26409#pullrequestreview-3037008439 From bmaillard at openjdk.org Mon Jul 21 08:17:59 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 21 Jul 2025 08:17:59 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist Message-ID: This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: - `ConvD2L->ConvL2D->ConvD2L` - `ConvF2I->ConvI2F->ConvF2I` - `ConvF2L->ConvL2F->ConvF2L` - `ConvI2F->ConvF2I->ConvI2F` Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. ### Testing - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) - [x] tier1-3, plus some internal testing - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) Thank you for reviewing! ------------- Commit messages: - Merge branch 'master' of https://git.openjdk.org/jdk into JDK-8359603 - 8359603: Remove switch in tests - 8359603: Rename test and add cases for other number types with analog optimization patterns - 8359603: Add other similar conversion patterns for which a missing opt could be trigerred - 8359603: Add opcode check on current node - 8359603: Add comment explaining the notification - 8359603: Add test from fuzzer - 8359603: Add missing notification Changes: https://git.openjdk.org/jdk/pull/26368/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26368&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8359603 Stats: 123 lines in 2 files changed: 123 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26368.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26368/head:pull/26368 PR: https://git.openjdk.org/jdk/pull/26368 From galder at openjdk.org Mon Jul 21 08:30:41 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Mon, 21 Jul 2025 08:30:41 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 11:50:33 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: > > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - also keep overlapping flag > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Revert RISCV Macro modification > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses > - riscv: fix c1 primitive array clone intrinsic regression Changes requested by galder (Author). Also, some comments on why these flags are needed when doing array copies would be good for future reference. src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp line 775: > 773: arraycopy_helper(x, &flags, &expected_type); > 774: if (x->check_flag(Instruction::OmitChecksFlag)) { > 775: flags = (flags & (LIR_OpArrayCopy::unaligned | LIR_OpArrayCopy::overlapping)); The changes in the two files need to be in synch, so I wonder if `LIR_OpArrayCopy::unaligned | LIR_OpArrayCopy::overlapping` could be abstracted away within a function in `LIR_OpArrayCopy`. So something like this (apologies for any syntactic/semantic errors): flags = (flags & LIR_OpArrayGopy::get_array_copy_flags()); Then on the other method something like: ((flags & ~(LIR_OpArrayGopy::get_array_copy_flags())) == 0) Function name is just an example, feel free to suggest some other if you think it fits better. Thoughts? ------------- PR Review: https://git.openjdk.org/jdk/pull/25976#pullrequestreview-3037040733 PR Comment: https://git.openjdk.org/jdk/pull/25976#issuecomment-3095714032 PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2218510943 From mli at openjdk.org Mon Jul 21 08:35:27 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 21 Jul 2025 08:35:27 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v2] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. > NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. > Also add some comments and do some other simple cleanup. > > Thanks! Hamlin Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - merge master - initial commit ------------- Changes: https://git.openjdk.org/jdk/pull/26370/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26370&range=01 Stats: 46 lines in 1 file changed: 8 ins; 3 del; 35 mod Patch: https://git.openjdk.org/jdk/pull/26370.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26370/head:pull/26370 PR: https://git.openjdk.org/jdk/pull/26370 From mhaessig at openjdk.org Mon Jul 21 08:39:39 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 21 Jul 2025 08:39:39 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v2] In-Reply-To: References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Mon, 21 Jul 2025 08:00:29 GMT, David Briemann wrote: >> Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > update copyright header test/hotspot/jtreg/compiler/lib/compile_framework/Compile.java line 186: > 184: try { > 185: Process process = builder.start(); > 186: long timeout = COMPILE_TIMEOUT * (long)Math.pow(2, timeoutFactor-1); Is there a reason for scaling the timeout exponentially instead of linearly, as jtreg does it and most users would expect? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26388#discussion_r2218529003 From mli at openjdk.org Mon Jul 21 08:45:36 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 21 Jul 2025 08:45:36 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v3] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. > NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. > Also add some comments and do some other simple cleanup. > > Thanks! Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26370/files - new: https://git.openjdk.org/jdk/pull/26370/files/1099e13a..f72db245 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26370&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26370&range=01-02 Stats: 13 lines in 1 file changed: 0 ins; 5 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/26370.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26370/head:pull/26370 PR: https://git.openjdk.org/jdk/pull/26370 From mli at openjdk.org Mon Jul 21 08:55:46 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 21 Jul 2025 08:55:46 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v3] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 07:52:53 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> fix > > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 131: > >> 129: } >> 130: >> 131: bool RelocCall::set_destination_mt_safe(address dest, bool assert_lock) { > > Seens you need to merge latest HEAD and rebase. The `assert_lock` param of `NativeFarCall::set_destination_mt_safe` has been removed recently. Thanks for reminding, it's merged. > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 190: > >> 188: assert(code != nullptr, "Could not find the containing code blob"); >> 189: >> 190: address dest = MacroAssembler::target_addr_for_insn(call_addr); > > Is this change safe? Seems it modifies the original logic. Yes, `MacroAssembler::pd_call_destination` only call `MacroAssembler::target_addr_for_insn`. And `MacroAssembler::target_addr_for_insn` are used in other places in NativeFarCall, so it's better to use `target_addr_for_insn` only to improve readability. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2218561860 PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2218560427 From dbriemann at openjdk.org Mon Jul 21 09:04:38 2025 From: dbriemann at openjdk.org (David Briemann) Date: Mon, 21 Jul 2025 09:04:38 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v3] In-Reply-To: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: > Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. David Briemann has updated the pull request incrementally with one additional commit since the last revision: make timeout factor scale linearly ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26388/files - new: https://git.openjdk.org/jdk/pull/26388/files/87fdf248..074bbca3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26388&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26388&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26388.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26388/head:pull/26388 PR: https://git.openjdk.org/jdk/pull/26388 From dbriemann at openjdk.org Mon Jul 21 09:04:41 2025 From: dbriemann at openjdk.org (David Briemann) Date: Mon, 21 Jul 2025 09:04:41 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v2] In-Reply-To: References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Mon, 21 Jul 2025 08:36:08 GMT, Manuel H?ssig wrote: >> David Briemann has updated the pull request incrementally with one additional commit since the last revision: >> >> update copyright header > > test/hotspot/jtreg/compiler/lib/compile_framework/Compile.java line 186: > >> 184: try { >> 185: Process process = builder.start(); >> 186: long timeout = COMPILE_TIMEOUT * (long)Math.pow(2, timeoutFactor-1); > > Is there a reason for scaling the timeout exponentially instead of linearly, as jtreg does it and most users would expect? I thought that I saw this documented somewhere but I cannot find it anymore and in other places it is used linearly, like you say. I adapted it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26388#discussion_r2218578639 From mli at openjdk.org Mon Jul 21 09:05:41 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 21 Jul 2025 09:05:41 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 12:14:08 GMT, Manuel H?ssig wrote: >> Hi, >> Can you help to review this simple patch? >> >> `CodeBuffer::copy_relocations_to(address buf, csize_t buf_limit, bool only_inst)` is only used in `copy_relocations_to(CodeBlob* dest)` which passes false to only_inst, so the former one should be able to be simplified. >> >> Thank you! > > Thank you for working on this cleanup, @Hamlin-Li! It looks good to me. > > I kicked off some testing on our side and will let you know what the results are. Hi @mhaessig , how's your test result? I guess it should be fine. Thanks ------------- PR Comment: https://git.openjdk.org/jdk/pull/26366#issuecomment-3095807017 From aph at openjdk.org Mon Jul 21 09:07:41 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 21 Jul 2025 09:07:41 GMT Subject: RFR: 8361890: Aarch64: Removal of redundant dmb from C1 AtomicLong methods In-Reply-To: <60YMRP6cNslwEeVX2TWmnMYdO872xGaeShKMEj0dWGY=.2f4f504f-93d1-4bab-b721-e5c964f4c465@github.com> References: <60YMRP6cNslwEeVX2TWmnMYdO872xGaeShKMEj0dWGY=.2f4f504f-93d1-4bab-b721-e5c964f4c465@github.com> Message-ID: <-qo6EhOT6c-G4PL4EVQxmEyuPthroTNc-7SjwzhXpzk=.82ccc4c3-ec11-47f6-954b-7aa0a9251b43@github.com> On Thu, 10 Jul 2025 15:49:40 GMT, Samuel Chee wrote: > The current C1 implementation of AtomicLong methods > which either adds or exchanges (such as getAndAdd) > emit one of a ldaddal and swpal respectively when using > LSE as well as an immediately proceeding dmb. Since > ldaddal/swpal have both acquire and release semantics, > this provides similar ordering guarantees to a dmb.full > so the dmb here is redundant and can be removed. > > This is due to both clause 7 and clause 11 of the > definition of Barrier-ordered-before in B2.3.7 of the > DDI0487 L.a Arm Architecture Reference Manual for A-profile > architecture being satisfied by the existence of a > ldaddal/swpal which ensures such memory ordering guarantees. Marked as reviewed by aph (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26245#pullrequestreview-3037151111 From mchevalier at openjdk.org Mon Jul 21 09:07:42 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Mon, 21 Jul 2025 09:07:42 GMT Subject: RFR: 8361890: Aarch64: Removal of redundant dmb from C1 AtomicLong methods In-Reply-To: <60YMRP6cNslwEeVX2TWmnMYdO872xGaeShKMEj0dWGY=.2f4f504f-93d1-4bab-b721-e5c964f4c465@github.com> References: <60YMRP6cNslwEeVX2TWmnMYdO872xGaeShKMEj0dWGY=.2f4f504f-93d1-4bab-b721-e5c964f4c465@github.com> Message-ID: <-pN_mmfICeaEHEJa4rrngIs2v2LaDGTMGE4OuuXVs08=.b7ab52d2-5f3a-4621-884f-2c9a207655cd@github.com> On Thu, 10 Jul 2025 15:49:40 GMT, Samuel Chee wrote: > The current C1 implementation of AtomicLong methods > which either adds or exchanges (such as getAndAdd) > emit one of a ldaddal and swpal respectively when using > LSE as well as an immediately proceeding dmb. Since > ldaddal/swpal have both acquire and release semantics, > this provides similar ordering guarantees to a dmb.full > so the dmb here is redundant and can be removed. > > This is due to both clause 7 and clause 11 of the > definition of Barrier-ordered-before in B2.3.7 of the > DDI0487 L.a Arm Architecture Reference Manual for A-profile > architecture being satisfied by the existence of a > ldaddal/swpal which ensures such memory ordering guarantees. It took a bit, but I'm back! I've run tier 1..3 and some internal testing, and it's passing everything. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26245#issuecomment-3095293982 From aph at openjdk.org Mon Jul 21 09:07:42 2025 From: aph at openjdk.org (Andrew Haley) Date: Mon, 21 Jul 2025 09:07:42 GMT Subject: RFR: 8361890: Aarch64: Removal of redundant dmb from C1 AtomicLong methods In-Reply-To: <-pN_mmfICeaEHEJa4rrngIs2v2LaDGTMGE4OuuXVs08=.b7ab52d2-5f3a-4621-884f-2c9a207655cd@github.com> References: <60YMRP6cNslwEeVX2TWmnMYdO872xGaeShKMEj0dWGY=.2f4f504f-93d1-4bab-b721-e5c964f4c465@github.com> <-pN_mmfICeaEHEJa4rrngIs2v2LaDGTMGE4OuuXVs08=.b7ab52d2-5f3a-4621-884f-2c9a207655cd@github.com> Message-ID: On Mon, 21 Jul 2025 05:48:25 GMT, Marc Chevalier wrote: > It took a bit, but I'm back! I've run tier 1..3 and some internal testing, and it's passing everything. The right test for this is jcstress, but even then results are only valid for a single implementation. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26245#issuecomment-3095815004 From mhaessig at openjdk.org Mon Jul 21 09:14:45 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 21 Jul 2025 09:14:45 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v3] In-Reply-To: References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Mon, 21 Jul 2025 09:04:38 GMT, David Briemann wrote: >> Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > make timeout factor scale linearly Thank you, @dbriemann, for addressing my comment. I just kicked off testing on our side and will keep you posted on the results. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26388#issuecomment-3095838390 From mhaessig at openjdk.org Mon Jul 21 09:15:45 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 21 Jul 2025 09:15:45 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 11:09:09 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple patch? > > `CodeBuffer::copy_relocations_to(address buf, csize_t buf_limit, bool only_inst)` is only used in `copy_relocations_to(CodeBlob* dest)` which passes false to only_inst, so the former one should be able to be simplified. > > Thank you! I messed up my first test run. The latest run should be done in about an hour and is all green so far. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26366#issuecomment-3095843654 From fyang at openjdk.org Mon Jul 21 09:48:40 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 21 Jul 2025 09:48:40 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 [v2] In-Reply-To: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> References: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> Message-ID: On Mon, 21 Jul 2025 05:01:45 GMT, Dingli Zhang wrote: >> Hi, >> Can you help to review this patch? Thanks! >> >> These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. >> >> We can refer to here: >> https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 >> >> According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. >> >> We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. >> >> ## Test (fastdebug) >> ### Test on k1 and qemu (w/ RVV, vlen=128) >> - compiler/vectorization/runner/LoopReductionOpTest.java >> - compiler/c2/irTests/TestIfMinMax.java >> - compiler/loopopts/superword/RedTest_long.java >> - compiler/loopopts/superword/SumRed_Long.java >> - compiler/loopopts/superword/TestGeneralizedReductions.java >> - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Apply fix only on RISC-V Updated change LGTM. Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26408#pullrequestreview-3037296678 From mhaessig at openjdk.org Mon Jul 21 10:40:39 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 21 Jul 2025 10:40:39 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 11:09:09 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple patch? > > `CodeBuffer::copy_relocations_to(address buf, csize_t buf_limit, bool only_inst)` is only used in `copy_relocations_to(CodeBlob* dest)` which passes false to only_inst, so the former one should be able to be simplified. > > Thank you! Testing passed. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26366#pullrequestreview-3037485780 From bkilambi at openjdk.org Mon Jul 21 11:09:04 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Mon, 21 Jul 2025 11:09:04 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v16] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Refine comments in c2_MacroAssembler_aarch64.cpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23570/files - new: https://git.openjdk.org/jdk/pull/23570/files/6c7266d7..1d553a94 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=14-15 Stats: 37 lines in 1 file changed: 31 ins; 4 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From mli at openjdk.org Mon Jul 21 11:13:50 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 21 Jul 2025 11:13:50 GMT Subject: Integrated: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 11:09:09 GMT, Hamlin Li wrote: > Hi, > Can you help to review this simple patch? > > `CodeBuffer::copy_relocations_to(address buf, csize_t buf_limit, bool only_inst)` is only used in `copy_relocations_to(CodeBlob* dest)` which passes false to only_inst, so the former one should be able to be simplified. > > Thank you! This pull request has now been integrated. Changeset: fd7f78a5 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/fd7f78a5351a5b00bc9a3173e7671afe2d1e6fe4 Stats: 10 lines in 2 files changed: 1 ins; 6 del; 3 mod 8362493: Cleanup CodeBuffer::copy_relocations_to Reviewed-by: mhaessig, kvn ------------- PR: https://git.openjdk.org/jdk/pull/26366 From mli at openjdk.org Mon Jul 21 11:13:50 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 21 Jul 2025 11:13:50 GMT Subject: RFR: 8362493: Cleanup CodeBuffer::copy_relocations_to In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 10:38:30 GMT, Manuel H?ssig wrote: > Testing passed. Thank you @mhaessig for testing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26366#issuecomment-3096237665 From mli at openjdk.org Mon Jul 21 11:16:53 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 21 Jul 2025 11:16:53 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 [v2] In-Reply-To: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> References: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> Message-ID: On Mon, 21 Jul 2025 05:01:45 GMT, Dingli Zhang wrote: >> Hi, >> Can you help to review this patch? Thanks! >> >> These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. >> >> We can refer to here: >> https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 >> >> According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. >> >> We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. >> >> ## Test (fastdebug) >> ### Test on k1 and qemu (w/ RVV, vlen=128) >> - compiler/vectorization/runner/LoopReductionOpTest.java >> - compiler/c2/irTests/TestIfMinMax.java >> - compiler/loopopts/superword/RedTest_long.java >> - compiler/loopopts/superword/SumRed_Long.java >> - compiler/loopopts/superword/TestGeneralizedReductions.java >> - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Apply fix only on RISC-V Thanks for working on this, looks good. ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26408#pullrequestreview-3037608933 From jbhateja at openjdk.org Mon Jul 21 11:31:47 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 21 Jul 2025 11:31:47 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 09:09:14 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Refactor the implementation > > Do the convertion in C2's IGVN phase to cover more cases. > - Merge branch 'master' into JDK-8356760 > - Simplify the test code > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 src/hotspot/share/opto/vectornode.cpp line 1977: > 1975: vect_type()->eq(in1->in(1)->bottom_type())) { > 1976: return in1->in(1); > 1977: } This is nice to have ideal transformation, but lane-compatible mask casts are anyway no-ops. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2218816668 From jbhateja at openjdk.org Mon Jul 21 12:10:47 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 21 Jul 2025 12:10:47 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 09:09:14 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Refactor the implementation > > Do the convertion in C2's IGVN phase to cover more cases. > - Merge branch 'master' into JDK-8356760 > - Simplify the test code > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 Rest of the patch looks good to me apart from minor changes proposed test/micro/org/openjdk/bench/jdk/incubator/vector/MaskFromLongToLongBenchmark.java line 34: > 32: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"}) > 33: public class MaskFromLongToLongBenchmark { > 34: private static final int ITERATION = 10000; It will be nice to add a synthetic micro for cast chain transform added along with this patch. following micro shows around 1.5x gains on AVX2 system. import jdk.incubator.vector.*; import java.util.stream.IntStream; public class mask_cast_chain { public static final VectorSpecies FSP = FloatVector.SPECIES_128; public static long micro(float [] src1, float [] src2, int ctr) { long res = 0; for (int i = 0; i < FSP.loopBound(src1.length); i += FSP.length()) { res += FloatVector.fromArray(FSP, src1, i) .compare(VectorOperators.GE, FloatVector.fromArray(FSP, src2, i)) .cast(DoubleVector.SPECIES_256) .cast(FloatVector.SPECIES_128) .toLong(); } return res * ctr; } public static void main(String [] args) { float [] src1 = new float[1024]; float [] src2 = new float[1024]; IntStream.range(0, src1.length).forEach(i -> {src1[i] = (float)i;}); IntStream.range(0, src2.length).forEach(i -> {src2[i] = (float)500;}); long res = 0; for (int i = 0; i < 100000; i++) { res += micro(src1, src2, i); } long t1 = System.currentTimeMillis(); for (int i = 0; i < 100000; i++) { res += micro(src1, src2, i); } long t2 = System.currentTimeMillis(); System.out.println("[time] " + (t2 - t1) + "ms" + " [res] " + res); } } ------------- PR Review: https://git.openjdk.org/jdk/pull/25793#pullrequestreview-3037791349 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2218999865 From mhaessig at openjdk.org Mon Jul 21 12:14:40 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 21 Jul 2025 12:14:40 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 [v2] In-Reply-To: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> References: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> Message-ID: On Mon, 21 Jul 2025 05:01:45 GMT, Dingli Zhang wrote: >> Hi, >> Can you help to review this patch? Thanks! >> >> These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. >> >> We can refer to here: >> https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 >> >> According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. >> >> We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. >> >> ## Test (fastdebug) >> ### Test on k1 and qemu (w/ RVV, vlen=128) >> - compiler/vectorization/runner/LoopReductionOpTest.java >> - compiler/c2/irTests/TestIfMinMax.java >> - compiler/loopopts/superword/RedTest_long.java >> - compiler/loopopts/superword/SumRed_Long.java >> - compiler/loopopts/superword/TestGeneralizedReductions.java >> - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Apply fix only on RISC-V Testing passed. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26408#pullrequestreview-3037810228 From mhaessig at openjdk.org Mon Jul 21 12:52:42 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 21 Jul 2025 12:52:42 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v3] In-Reply-To: References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: <5W0bw4i-UXhewzb599af1F95TD3JIBUvmGwRGuMu6BI=.5d99ac61-d0e6-41b9-89b1-c479033e21e8@github.com> On Mon, 21 Jul 2025 09:04:38 GMT, David Briemann wrote: >> Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > make timeout factor scale linearly Testing passed. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26388#pullrequestreview-3037962096 From snatarajan at openjdk.org Mon Jul 21 13:12:56 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Mon, 21 Jul 2025 13:12:56 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth Message-ID: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> **Issue** Extreme values for BciProfileWidth flag such as `java -XX:BciProfileWidth=-1 -version` and `java -XX:BciProfileWidth=100000 -version `results in assert failure `assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. This is observed in a x86 machine. **Analysis** On debugging the issue, I found that increasing the size of the interpreter using the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` prevented the above mentioned assert from failing for large values of BciProfileWidth. **Proposal** Considering the fact that larger BciProfileWidth results in slower profiling, I have proposed a range between 0 to 5000 to restrict the value for BciProfileWidth for x86 machines. This maximum value is based on modifying the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` using the smallest `InterpreterCodeSize` for all the architectures. As for the lower bound, a value of -1 would be the same as 0, as this simply means no return bci's will be recorded in ret profile. **Issue in AArch64** Additionally running the command `java -XX:BciProfileWidth= 10000 -version` (or larger values) results in a different failure `assert(offset_ok_for_immed(offset(), size)) failed: must be, was: 32768, 3` on an AArch64 machine.This is an issue of maximum offset for `ldr/str` in AArch64 which can be fixed using `form_address` as mentioned in [JDK-8342736](https://bugs.openjdk.org/browse/JDK-8342736). In my preliminary fix using `form_address` on AArch64 machine. I had to modify 3 `ldr` and 1 `str` instruction (in file `src/hotspot/cpu/aarch64/interp_masm_aarch64.cpp` at line number 926, 983, and 997). With this fix using `form_address`, `BciProfileWidth` works for maximum of 5000 after which it crashes with`assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. Without this fix `BciProfileWidth` works for a maximum value of 1300. Currently, I have suggested to restrict the upper bound on AArch64 to 1000 instead of fixing it with `form_address`. **Question to reviewers** Do you think this is a reasonable fix ? For AArch64 do you suggest fixing using `form_address` ? If yes, do I fix it under this PR or create another one ? ------------- Commit messages: - Fix for AArch64 - Modified the upper bound - initial commit Changes: https://git.openjdk.org/jdk/pull/26139/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8358696 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26139.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26139/head:pull/26139 PR: https://git.openjdk.org/jdk/pull/26139 From dzhang at openjdk.org Mon Jul 21 13:33:50 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 21 Jul 2025 13:33:50 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 [v2] In-Reply-To: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> References: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> Message-ID: On Mon, 21 Jul 2025 05:01:45 GMT, Dingli Zhang wrote: >> Hi, >> Can you help to review this patch? Thanks! >> >> These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. >> >> We can refer to here: >> https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 >> >> According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. >> >> We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. >> >> ## Test (fastdebug) >> ### Test on k1 and qemu (w/ RVV, vlen=128) >> - compiler/vectorization/runner/LoopReductionOpTest.java >> - compiler/c2/irTests/TestIfMinMax.java >> - compiler/loopopts/superword/RedTest_long.java >> - compiler/loopopts/superword/SumRed_Long.java >> - compiler/loopopts/superword/TestGeneralizedReductions.java >> - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Apply fix only on RISC-V Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26408#issuecomment-3096802885 From duke at openjdk.org Mon Jul 21 13:33:50 2025 From: duke at openjdk.org (duke) Date: Mon, 21 Jul 2025 13:33:50 GMT Subject: RFR: 8357694: RISC-V: Several IR verification tests fail when vlen=128 [v2] In-Reply-To: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> References: <__-vaE1gtG6vsSH8CWGTh4HPOy267edyiA3viq-YtWc=.163ece35-ae6a-4fec-9d7a-81dcbb72fa38@github.com> Message-ID: On Mon, 21 Jul 2025 05:01:45 GMT, Dingli Zhang wrote: >> Hi, >> Can you help to review this patch? Thanks! >> >> These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. >> >> We can refer to here: >> https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 >> >> According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. >> >> We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. >> >> ## Test (fastdebug) >> ### Test on k1 and qemu (w/ RVV, vlen=128) >> - compiler/vectorization/runner/LoopReductionOpTest.java >> - compiler/c2/irTests/TestIfMinMax.java >> - compiler/loopopts/superword/RedTest_long.java >> - compiler/loopopts/superword/SumRed_Long.java >> - compiler/loopopts/superword/TestGeneralizedReductions.java >> - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java > > Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision: > > Apply fix only on RISC-V @DingliZhang Your change (at version 43815f2f88de706cf3badb524147c8ce278182a2) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26408#issuecomment-3096804806 From dzhang at openjdk.org Mon Jul 21 13:38:49 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Mon, 21 Jul 2025 13:38:49 GMT Subject: Integrated: 8357694: RISC-V: Several IR verification tests fail when vlen=128 In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 02:30:03 GMT, Dingli Zhang wrote: > Hi, > Can you help to review this patch? Thanks! > > These tests failed because with a vlen of 128, these tests generate vectors containing only two elements. However, ??2-element reductions for INT/LONG are not profitable??, so the compiler won't generate the corresponding reductions IR. > > We can refer to here: > https://github.com/openjdk/jdk/blob/441dbde2c3c915ffd916e39a5b4a91df5620d7f3/src/hotspot/share/opto/superword.cpp#L1606-L1633 > > According to the explanation above, when I use `-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2`, these cases passed with vlen=128. > > We can fix this problem by adding the restriction of `MaxVectorSize` greater than or equal to 32 (256 bits) to these test cases. > > ## Test (fastdebug) > ### Test on k1 and qemu (w/ RVV, vlen=128) > - compiler/vectorization/runner/LoopReductionOpTest.java > - compiler/c2/irTests/TestIfMinMax.java > - compiler/loopopts/superword/RedTest_long.java > - compiler/loopopts/superword/SumRed_Long.java > - compiler/loopopts/superword/TestGeneralizedReductions.java > - compiler/loopopts/superword/TestUnorderedReductionPartialVectorization.java This pull request has now been integrated. Changeset: 15b5b54a Author: Dingli Zhang Committer: Hamlin Li URL: https://git.openjdk.org/jdk/commit/15b5b54ac707ba0d4e473fd6eb02c38a8efe705c Stats: 49 lines in 6 files changed: 36 ins; 0 del; 13 mod 8357694: RISC-V: Several IR verification tests fail when vlen=128 Reviewed-by: mhaessig, fyang, mli ------------- PR: https://git.openjdk.org/jdk/pull/26408 From fjiang at openjdk.org Mon Jul 21 14:41:43 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Mon, 21 Jul 2025 14:41:43 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 08:26:34 GMT, Galder Zamarre?o wrote: >> Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: >> >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - also keep overlapping flag >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Revert RISCV Macro modification >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses >> - riscv: fix c1 primitive array clone intrinsic regression > > src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp line 775: > >> 773: arraycopy_helper(x, &flags, &expected_type); >> 774: if (x->check_flag(Instruction::OmitChecksFlag)) { >> 775: flags = (flags & (LIR_OpArrayCopy::unaligned | LIR_OpArrayCopy::overlapping)); > > The changes in the two files need to be in synch, so I wonder if `LIR_OpArrayCopy::unaligned | LIR_OpArrayCopy::overlapping` could be abstracted away within a function in `LIR_OpArrayCopy`. > > So something like this (apologies for any syntactic/semantic errors): > > > flags = (flags & LIR_OpArrayGopy::get_array_copy_flags()); > > > Then on the other method something like: > > > ((flags & ~(LIR_OpArrayGopy::get_array_copy_flags())) == 0) > > > Function name is just an example, feel free to suggest some other if you think it fits better. > > Thoughts? Adding new flag check routines seems like a good idea, but it's a bit challenging to choose a name, as there are too many flags for `LIR_OPArrayCopy`[1]. Perhaps something like `should_check_unaligned_or_overlapping` would be suitable? 1. https://github.com/openjdk/jdk/blob/15b5b54ac707ba0d4e473fd6eb02c38a8efe705c/src/hotspot/share/c1/c1_LIR.hpp#L1257-L1271 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2219422716 From mbaesken at openjdk.org Mon Jul 21 14:49:40 2025 From: mbaesken at openjdk.org (Matthias Baesken) Date: Mon, 21 Jul 2025 14:49:40 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v3] In-Reply-To: References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Mon, 21 Jul 2025 09:04:38 GMT, David Briemann wrote: >> Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > make timeout factor scale linearly Marked as reviewed by mbaesken (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26388#pullrequestreview-3038485564 From eastigeevich at openjdk.org Mon Jul 21 15:02:26 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Mon, 21 Jul 2025 15:02:26 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 16:19:51 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: > > Require caller to hold locks lgtm ------------- Marked as reviewed by eastigeevich (Committer). PR Review: https://git.openjdk.org/jdk/pull/23573#pullrequestreview-3038528227 From dbriemann at openjdk.org Mon Jul 21 15:29:31 2025 From: dbriemann at openjdk.org (David Briemann) Date: Mon, 21 Jul 2025 15:29:31 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v3] In-Reply-To: References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Mon, 21 Jul 2025 09:04:38 GMT, David Briemann wrote: >> Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > make timeout factor scale linearly Thanks for your reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26388#issuecomment-3097243336 From duke at openjdk.org Mon Jul 21 15:35:32 2025 From: duke at openjdk.org (duke) Date: Mon, 21 Jul 2025 15:35:32 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v3] In-Reply-To: References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Mon, 21 Jul 2025 09:04:38 GMT, David Briemann wrote: >> Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > make timeout factor scale linearly @dbriemann Your change (at version 074bbca3256e00230ed2829e909d8dffc3076f50) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26388#issuecomment-3097264481 From kvn at openjdk.org Mon Jul 21 15:45:50 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 21 Jul 2025 15:45:50 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v24] In-Reply-To: References: Message-ID: On Tue, 3 Jun 2025 00:32:26 GMT, Vladimir Kozlov wrote: >> Thanks for pointing out the missing JVMTI event publication. I?m currently looking into what?s required to address that, along with JFR event publication that may also have been missed. I?d appreciate hearing others? thoughts on how critical this is: should we treat it as a blocker for integration, or would it be acceptable to follow up with a separate issue? >> >> We?re hoping to get this into JDK 25, as it would simplify both development and backporting of features related to hot code grouping. That said, if the consensus is that JVMTI/JFR support is essential upfront, this can be delayed until JDK 26. > >> We?re hoping to get this into JDK 25, as it would simplify both development and backporting of features related to hot code grouping. That said, if the consensus is that JVMTI/JFR support is essential upfront, this can be delayed until JDK 26. > > I don't think this can be put into JDK 25. Too late and changes are not simple. And yes, JVMTI/JFR support is essential - you have to support all public functionalities of VM. > @vnkozlov When you get a chance, would you mind taking another look at this PR? @chadrako I promise to look soon but currently I am busy with Leyden before JVMLS. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3097295698 From jbhateja at openjdk.org Mon Jul 21 15:48:28 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 21 Jul 2025 15:48:28 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: On Mon, 14 Jul 2025 17:27:35 GMT, Srinivas Vamsi Parasa wrote: >> src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 114: >> >>> 112: __ paired_push(rax); >>> 113: } >>> 114: __ paired_push(rcx); >> >> Hi @vamsi-parasa , for consecutive push/pop we can use push2/pop2 and 16byte alignment can be guaranteed using following technique >> https://github.com/openjdk/jdk/pull/25351/files#diff-d5d721ebf93346ba66e81257e4f6c5e6268d59774313c61e97353c0dfbf686a5R94 > > Hi Jatin (@jatin-bhateja), for the first iteration, would it be ok to get the push_paired/pop_paired changes integrated and then make the push2p/pop2p related optimizations in a separate PR? > > Thanks, > Vamsi Hi @vamsi-parasa , I think it's ok not to expose pop_ppx / push_ppx as separate interfaces, and let processor forward the values b/w push and matching pop if balancing constraints are satisfied. image ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2219586716 From clanger at openjdk.org Mon Jul 21 15:48:38 2025 From: clanger at openjdk.org (Christoph Langer) Date: Mon, 21 Jul 2025 15:48:38 GMT Subject: RFR: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts [v3] In-Reply-To: References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Mon, 21 Jul 2025 09:04:38 GMT, David Briemann wrote: >> Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. > > David Briemann has updated the pull request incrementally with one additional commit since the last revision: > > make timeout factor scale linearly Marked as reviewed by clanger (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26388#pullrequestreview-3038708343 From dbriemann at openjdk.org Mon Jul 21 15:51:42 2025 From: dbriemann at openjdk.org (David Briemann) Date: Mon, 21 Jul 2025 15:51:42 GMT Subject: Integrated: 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts In-Reply-To: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> References: <7TIxniSPYdjC2hKjxZDsj4c8SEgg3eNOg2i6XRvZYRc=.e075499f-0d11-4024-b60f-a2b2c0d6e706@github.com> Message-ID: On Fri, 18 Jul 2025 13:16:41 GMT, David Briemann wrote: > Add the TimeoutFactor property to the CompileFramework to avoid timeouts on different systems. This pull request has now been integrated. Changeset: f8c8bcf4 Author: David Briemann Committer: Christoph Langer URL: https://git.openjdk.org/jdk/commit/f8c8bcf4fd31509fdb40d32e8e16ba4fba1f987d Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod 8362602: Add test.timeout.factor to CompileFactory to avoid test timeouts Reviewed-by: mhaessig, mbaesken, clanger ------------- PR: https://git.openjdk.org/jdk/pull/26388 From kxu at openjdk.org Mon Jul 21 16:49:56 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 21 Jul 2025 16:49:56 GMT Subject: RFR: 8347555: [REDO] C2: implement optimization for series of Add of unique value [v17] In-Reply-To: References: Message-ID: > [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. > > When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) > > The following was implemented to address this issue. > > if (UseNewCode2) { > *multiplier = bt == T_INT > ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows > : ((jlong) 1) << con->get_int(); > } else { > *multiplier = ((jlong) 1 << con->get_int()); > } > > > Two new bitshift overflow tests were added. Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 66 commits: - Merge remote-tracking branch 'origin/master' into arithmetic-canonicalization - Allow swapping LHS/RHS in case not matched - Merge branch 'refs/heads/master' into arithmetic-canonicalization - improve comment readability and struct helper functions - remove asserts, add more documentation - fix typo: lhs->rhs - update comments - use java_add to avoid cpp overflow UB - add assertion for MulLNode too - include simple addition as a case of power of two additions - ... and 56 more: https://git.openjdk.org/jdk/compare/f8c8bcf4...1f6f2bc6 ------------- Changes: https://git.openjdk.org/jdk/pull/23506/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23506&range=16 Stats: 849 lines in 6 files changed: 848 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23506.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23506/head:pull/23506 PR: https://git.openjdk.org/jdk/pull/23506 From kxu at openjdk.org Mon Jul 21 17:22:40 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Mon, 21 Jul 2025 17:22:40 GMT Subject: RFR: 8353290: C2: Refactor PhaseIdealLoop::is_counted_loop() [v7] In-Reply-To: References: Message-ID: > This PR refactors `PhaseIdealLoop::is_counted_loop()` into (mostly) `CountedLoopConverter::is_counted_loop()` and `CountedLoopConverter::convert()` to decouple the detection and conversion code. This enables us to try different loop configurations easily and finally convert once a counted loop is found. > > A nested `PhaseIdealLoop::CountedLoopConverter` class is created to handle the context, but I'm not if this is the best name or place for it. Please let me know what you think. > > Blocks [JDK-8336759](https://bugs.openjdk.org/browse/JDK-8336759). Kangcheng Xu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: - Merge remote-tracking branch 'origin/master' into counted-loop-refactor # Conflicts: # src/hotspot/share/opto/loopnode.cpp # src/hotspot/share/opto/loopnode.hpp - Merge branch 'master' into counted-loop-refactor # Conflicts: # src/hotspot/share/opto/loopnode.cpp # src/hotspot/share/opto/loopnode.hpp # src/hotspot/share/opto/loopopts.cpp - Merge remote-tracking branch 'origin/master' into counted-loop-refactor - further refactor is_counted_loop() by extracting functions - WIP: refactor is_counted_loop() - WIP: refactor is_counted_loop() - WIP: review followups - reviewer suggested changes - line break - remove TODOs - ... and 13 more: https://git.openjdk.org/jdk/compare/f8c8bcf4...345c6ccc ------------- Changes: https://git.openjdk.org/jdk/pull/24458/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24458&range=06 Stats: 926 lines in 3 files changed: 422 ins; 210 del; 294 mod Patch: https://git.openjdk.org/jdk/pull/24458.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24458/head:pull/24458 PR: https://git.openjdk.org/jdk/pull/24458 From sparasa at openjdk.org Mon Jul 21 17:25:26 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 21 Jul 2025 17:25:26 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: On Mon, 21 Jul 2025 15:44:47 GMT, Jatin Bhateja wrote: >> Hi Jatin (@jatin-bhateja), for the first iteration, would it be ok to get the push_paired/pop_paired changes integrated and then make the push2p/pop2p related optimizations in a separate PR? >> >> Thanks, >> Vamsi > > Hi @vamsi-parasa , I think it's ok not to expose pop_ppx / push_ppx as separate interfaces, and let processor forward the values b/w push and matching pop if balancing constraints are satisfied. > > image Hi Jatin (@jatin-bhateja), the reason to make the push_ppx/pop_ppx usage explicit is because an unbalanced push_ppx operation has a performance penalty. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2219810277 From sparasa at openjdk.org Mon Jul 21 17:34:37 2025 From: sparasa at openjdk.org (Srinivas Vamsi Parasa) Date: Mon, 21 Jul 2025 17:34:37 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: On Mon, 21 Jul 2025 17:23:15 GMT, Srinivas Vamsi Parasa wrote: >> Hi @vamsi-parasa , I think it's ok not to expose pop_ppx / push_ppx as separate interfaces, and let processor forward the values b/w push and matching pop if balancing constraints are satisfied. >> >> image > > Hi Jatin (@jatin-bhateja), the reason to make the push_ppx/pop_ppx usage explicit is because an unbalanced push_ppx operation has a performance penalty. > Please create a new RFE for its tracking. Hi Jatin(@jatin-bhateja) , please see the JBS issue (https://bugs.openjdk.org/browse/JDK-8362903) for push2/pop2 enabling in future. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2219828160 From dlong at openjdk.org Mon Jul 21 20:03:01 2025 From: dlong at openjdk.org (Dean Long) Date: Mon, 21 Jul 2025 20:03:01 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: References: Message-ID: On Fri, 11 Jul 2025 11:50:33 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: > > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - also keep overlapping flag > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Revert RISCV Macro modification > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses > - riscv: fix c1 primitive array clone intrinsic regression Testing at Oracle passed. ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25976#pullrequestreview-3039585537 From kvn at openjdk.org Mon Jul 21 20:24:33 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 21 Jul 2025 20:24:33 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: <6XRbYO3L5zyEIH2MuKSntl2IdAG2zgNVNm_0PakEI9g=.40266744-d0d0-421f-b5ff-17fd0117b693@github.com> On Wed, 16 Jul 2025 10:43:34 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Refine lower bound computation I am not sure about JDK 25 approval for these changes. Can you do simple fix for JDK 25 as @merykitty suggested: "I suggest removing all the logic and simply returning the bottom type" ? Will it be the same complexity? Will it affect performance (and how much)? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3098578332 From vlivanov at openjdk.org Mon Jul 21 21:15:31 2025 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Mon, 21 Jul 2025 21:15:31 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v5] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 12:36:31 GMT, Marc Chevalier wrote: >> A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. >> >> ## Pure Functions >> >> Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. >> >> ## Scope >> >> We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. >> >> ## Implementation Overview >> >> We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. >> >> This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Tentative to address Tobias' comments Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25760#pullrequestreview-3039873221 From kvn at openjdk.org Mon Jul 21 23:16:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 21 Jul 2025 23:16:56 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 16:19:51 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: > > Require caller to hold locks I have few comments. And I did not look on tests. Did you check `src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/NMethod.java` since you added Nmethod reference counter? src/hotspot/share/code/codeBehaviours.cpp line 46: > 44: bool DefaultICProtectionBehaviour::is_safe(nmethod* method) { > 45: return SafepointSynchronize::is_at_safepoint() || CompiledIC_lock->owned_by_self() || method->is_not_installed(); > 46: } Can you rename `method` to `nm` as we call it in similar code in GCs? src/hotspot/share/code/nmethod.cpp line 1164: > 1162: #endif > 1163: + align_up(debug_info->data_size() , oopSize) > 1164: + align_up(ImmutableDataReferencesCounterSize , oopSize); Why you need to realign this? There is no requirement to have spaces before `,` src/hotspot/share/code/nmethod.cpp line 1630: > 1628: if (!is_java_method()) { > 1629: return false; > 1630: } This should be first check. src/hotspot/share/code/nmethod.cpp line 2453: > 2451: // Free memory if this is the last nmethod referencing immutable data > 2452: if (get_immutable_data_references_counter() == 1) { > 2453: os::free(_immutable_data); You should add assert(get_immutable_data_references_counter() > 0 before `if (counter == 1)` and zero it when freed. ------------- PR Review: https://git.openjdk.org/jdk/pull/23573#pullrequestreview-3040057036 PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220488851 PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220501624 PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220527842 PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220532609 From kvn at openjdk.org Mon Jul 21 23:16:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 21 Jul 2025 23:16:56 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: On Wed, 12 Mar 2025 18:44:39 GMT, Vladimir Kozlov wrote: >> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: >> >> Require caller to hold locks > > src/hotspot/share/code/nmethod.cpp line 1406: > >> 1404: _oop_maps = nm.oop_maps()->clone(); >> 1405: } >> 1406: _relocation_size = nm._relocation_size; > > Did you consider to use `memcpy()` and update only changed fields? You did not answered my question adout using `memcpy ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220503348 From duke at openjdk.org Mon Jul 21 23:41:35 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Mon, 21 Jul 2025 23:41:35 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 22:39:27 GMT, Vladimir Kozlov wrote: >> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: >> >> Require caller to hold locks > > src/hotspot/share/code/nmethod.cpp line 1164: > >> 1162: #endif >> 1163: + align_up(debug_info->data_size() , oopSize) >> 1164: + align_up(ImmutableDataReferencesCounterSize , oopSize); > > Why you need to realign this? There is no requirement to have spaces before `,` It was aligned this way before my change. The addition of `ImmutableDataReferencesCounterSize` requires more spaces to keep it the same. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220580694 From duke at openjdk.org Mon Jul 21 23:51:36 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Mon, 21 Jul 2025 23:51:36 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: <8heatt0AXFDKj3NiHyA0RjjrNaOOMxGPFEwPuEuweSg=.44cef183-4e4f-4a1b-8add-046ce8e874c7@github.com> On Mon, 21 Jul 2025 22:40:32 GMT, Vladimir Kozlov wrote: >> src/hotspot/share/code/nmethod.cpp line 1406: >> >>> 1404: _oop_maps = nm.oop_maps()->clone(); >>> 1405: } >>> 1406: _relocation_size = nm._relocation_size; >> >> Did you consider to use `memcpy()` and update only changed fields? > > You did not answered my question adout using `memcpy I had changed the implementation to use `memcpy()` instead but as @fisk pointed out in https://github.com/openjdk/jdk/pull/23573#issuecomment-2797120660 it was easier to accidentally copy a value unintentionally so I reverted that change > > > I'm worried about copying the nmethod epoch counters > > > > We should clear them. If not, it is a bug. > > I'd like to change copying from opt-out to opt-in instead; that would make me feel more comfortable. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220591570 From kvn at openjdk.org Mon Jul 21 23:59:34 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 21 Jul 2025 23:59:34 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 23:38:27 GMT, Chad Rakoczy wrote: >> src/hotspot/share/code/nmethod.cpp line 1164: >> >>> 1162: #endif >>> 1163: + align_up(debug_info->data_size() , oopSize) >>> 1164: + align_up(ImmutableDataReferencesCounterSize , oopSize); >> >> Why you need to realign this? There is no requirement to have spaces before `,` > > It was aligned this way before my change. The addition of `ImmutableDataReferencesCounterSize` requires more spaces to keep it the same. No it does not - you have a lot of unneeded spaces after `ImmutableDataReferencesCounterSize` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220600736 From kvn at openjdk.org Mon Jul 21 23:59:35 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 21 Jul 2025 23:59:35 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: <8heatt0AXFDKj3NiHyA0RjjrNaOOMxGPFEwPuEuweSg=.44cef183-4e4f-4a1b-8add-046ce8e874c7@github.com> References: <8heatt0AXFDKj3NiHyA0RjjrNaOOMxGPFEwPuEuweSg=.44cef183-4e4f-4a1b-8add-046ce8e874c7@github.com> Message-ID: <_gUMP-cUWsM_DfTzxl7j0sza2lrS4iEiwQ2_yoopffE=.fbdd0659-3da9-4d16-9d53-8dea0808ccba@github.com> On Mon, 21 Jul 2025 23:49:14 GMT, Chad Rakoczy wrote: >> You did not answered my question adout using `memcpy > > I had changed the implementation to use `memcpy()` instead but as @fisk pointed out in https://github.com/openjdk/jdk/pull/23573#issuecomment-2797120660 it was easier to accidentally copy a value unintentionally so I reverted that change > >> > > I'm worried about copying the nmethod epoch counters >> > >> > We should clear them. If not, it is a bug. >> >> I'd like to change copying from opt-out to opt-in instead; that would make me feel more comfortable. Okay. We may look later to optimize it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220602281 From duke at openjdk.org Tue Jul 22 00:05:38 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Tue, 22 Jul 2025 00:05:38 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 23:55:26 GMT, Vladimir Kozlov wrote: >> It was aligned this way before my change. The addition of `ImmutableDataReferencesCounterSize` requires more spaces to keep it the same. > > No it does not - you have a lot of unneeded spaces after `ImmutableDataReferencesCounterSize` Oh sorry I see what you mean. I was aligning it to the nearest tab. I'll delete the extra spaces ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2220608957 From duke at openjdk.org Tue Jul 22 00:56:36 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Tue, 22 Jul 2025 00:56:36 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 23:13:28 GMT, Vladimir Kozlov wrote: > Did you check `src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/code/NMethod.java` since you added Nmethod reference counter? No I didn't good catch ------------- PR Comment: https://git.openjdk.org/jdk/pull/23573#issuecomment-3100183774 From duke at openjdk.org Tue Jul 22 01:05:53 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Tue, 22 Jul 2025 01:05:53 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v39] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [ ] Linux x64 fastdebug all > - [ ] Linux aarch64 fastdebug all > - [ ] ... Chad Rakoczy has updated the pull request incrementally with five additional commits since the last revision: - Fix spacing - Update NMethod.java with immutable data changes - Rename method to nm - Add assert before freeing immutable data - Reorder is_relocatable checks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23573/files - new: https://git.openjdk.org/jdk/pull/23573/files/1dcf47e4..1b001df8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=38 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=37-38 Stats: 62 lines in 6 files changed: 10 ins; 4 del; 48 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From fjiang at openjdk.org Tue Jul 22 01:12:32 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 22 Jul 2025 01:12:32 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 19:59:34 GMT, Dean Long wrote: > Testing at Oracle passed. Thanks for the testing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25976#issuecomment-3100212006 From yadongwang at openjdk.org Tue Jul 22 01:26:31 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Tue, 22 Jul 2025 01:26:31 GMT Subject: Integrated: 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: <9JVlb4b9VLY29SI_ZH2UBrbsLnEySIxVKbZ939o-7EE=.8690c011-8971-4c20-b8cc-a5d85e3efdd0@github.com> On Thu, 10 Jul 2025 18:29:14 GMT, Yadong Wang wrote: > The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base. > > C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc. > > // The assembler store_check code will do an unsigned shift of the oop, > // then add it to _byte_map_base, i.e. > // > // _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift) > _byte_map = (CardValue*) rs.base(); > _byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift); > > In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand. > > // Card Table Byte Map Base > operand immByteMapBase() > %{ > // Get base of card map > predicate((jbyte*)n->get_ptr() == > ((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base); > match(ConP); > > op_cost(0); > format %{ %} > interface(CONST_INTER); > %} > > // Load Byte Map Base Constant > instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con) > %{ > match(Set dst con); > > ins_cost(INSN_COST); > format %{ "adr $dst, $con\t# Byte Map Base" %} > > ins_encode(aarch64_enc_mov_byte_map_base(dst, con)); > > ins_pipe(ialu_imm); > %} > > As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address: > 0xffff25caf08c: ldaxr x8, [x11] > 0xffff25caf090: cmp x10, x8 > 0xffff25caf094: b.ne 0xffff25caf0a0 // b.any > 0xffff25caf098: stlxr w8, x28, [x11] > 0xffff25caf09c: cbnz w8, 0xffff25caf08c > 0xffff25caf0a0: orr x11, xzr, #0x3 > 0xffff25caf0a4: str x11, [x13] > 0xffff25caf0a8: b.eq 0xffff25caef80 // b.none > 0xffff25caf0ac: str x14, [sp] > 0xffff25caf0b0: add x2, sp, #0x20 > 0xffff25caf0b4: adrp x1, 0xffff21730000 > 0xffff25caf0b8: bl 0xffff256fffc0 > 0xffff25caf0bc: ldr x14, [sp] > 0xffff25caf0c0: b 0xffff25caef80 > 0xffff25caf0c4: add x13, sp, #0x20 > 0xffff25caf0c8: adrp x12, 0xffff21730000 > 0xffff25caf0cc: ldr x10, [x13] > 0xffff25caf0d0: cmp x10, xzr > 0xffff25caf0d4: b.eq 0xffff25caf130 // b.none > 0xffff25caf0d8: ldr x11, [x12] > 0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f... This pull request has now been integrated. Changeset: dccb1782 Author: Yadong Wang URL: https://git.openjdk.org/jdk/commit/dccb1782ec35d1ee95220a237aef29ddfc292cbd Stats: 32 lines in 1 file changed: 0 ins; 32 del; 0 mod 8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding Reviewed-by: shade, adinn ------------- PR: https://git.openjdk.org/jdk/pull/26249 From duke at openjdk.org Tue Jul 22 03:22:26 2025 From: duke at openjdk.org (erifan) Date: Tue, 22 Jul 2025 03:22:26 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 09:09:14 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Refactor the implementation > > Do the convertion in C2's IGVN phase to cover more cases. > - Merge branch 'master' into JDK-8356760 > - Simplify the test code > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 Thanks for your review, I'll update the code soon. ------------- PR Review: https://git.openjdk.org/jdk/pull/25793#pullrequestreview-3040675294 From duke at openjdk.org Tue Jul 22 03:22:28 2025 From: duke at openjdk.org (erifan) Date: Tue, 22 Jul 2025 03:22:28 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: References: Message-ID: <-ZNeXOcmEACkhP4QKXKnWWEiT6ucjPY7Zz1HqvMeAoI=.c8fae49e-fcb0-41fb-84d1-4aa52ee83790@github.com> On Mon, 21 Jul 2025 06:41:43 GMT, Jatin Bhateja wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Refactor the implementation >> >> Do the convertion in C2's IGVN phase to cover more cases. >> - Merge branch 'master' into JDK-8356760 >> - Simplify the test code >> - Address some review comments >> >> Add support for the following patterns: >> toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) >> toLong(maskAll(false)) => 0 >> >> And add more test cases. >> - Merge branch 'master' into JDK-8356760 >> - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases >> >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would >> set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent >> to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is >> relative smaller than that of `fromLong`. This patch does the conversion >> for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize >> maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since >> the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific >> compile-time constant, the statement will be hoisted out of the loop. >> If we don't use a loop, the hotspot will become other instructions, and >> no obvious performance change was observed. However, combined with the >> optimization of [1], we can observe a performance improvement of about >> 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and >> tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > src/hotspot/share/opto/vectorIntrinsics.cpp line 692: > >> 690: // generate a MaskAll or Replicate instead. >> 691: >> 692: // The "maskAll" API uses the corresponding integer types for floating-point data. > > This is because mask all only accepts -1 and 0 values, since -1.0f in float in IEEE 754 format does not set all bits hence an floating point to integral conversion is mandatory here. Good to know this, thanks! > src/hotspot/share/opto/vectornode.cpp line 1520: > >> 1518: uint vlen = vt->length(); >> 1519: BasicType bt = vt->element_basic_type(); >> 1520: int opc = is_mask ? Op_MaskAll : Op_Replicate; > > You can remove this check, since VectorNode::scalar2vector alreday has a match rule for Op_MaskAll Do you mean this check `Matcher::match_rule_supported_vector(opc, vlen, maskall_bt)` ? I think it's necessary ? Because in theory some platforms don't support both `MaskAll` and `Replicate`. Of course, this situation may not exist in reality. If `MaskAll` and `Replicate` are not supported, then `VectorLongToMask` should not be supported either, and this function will not be called. > src/hotspot/share/opto/vectornode.cpp line 1532: > >> 1530: } else { >> 1531: con = phase->intcon(con_value); >> 1532: } > > Suggestion: > > phase->makecon(TypeInteger::make(bits_type->get_con(), maskall_bt) This should be: `con = phase->makecon(TypeInteger::make(con_value, maskall_bt == T_LONG ? T_LONG : T_INT));` because `maskall_bt` can be `T_BYTE` or `T_SHORT`. Since we still need to check `maskall_bt`, I tend to the current approach because it has fewer function calls. > src/hotspot/share/opto/vectornode.cpp line 1544: > >> 1542: >> 1543: Node* VectorLoadMaskNode::Ideal(PhaseGVN* phase, bool can_reshape) { >> 1544: // VectorLoadMask(VectorLongToMask(-1/0)) => Replicate(-1/0) > > FTR: This is only useful for non-predicated targets. Since on predicated target VectorLongToMask is not succeeded by VectorLoadMask > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L703 Yes, thanks for your clarification. > test/micro/org/openjdk/bench/jdk/incubator/vector/MaskFromLongToLongBenchmark.java line 34: > >> 32: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"}) >> 33: public class MaskFromLongToLongBenchmark { >> 34: private static final int ITERATION = 10000; > > It will be nice to add a synthetic micro for cast chain transform added along with this patch. following micro shows around 1.5x gains on AVX2 system because of widening cast elision. > > > import jdk.incubator.vector.*; > import java.util.stream.IntStream; > > public class mask_cast_chain { > public static final VectorSpecies FSP = FloatVector.SPECIES_128; > > public static long micro(float [] src1, float [] src2, int ctr) { > long res = 0; > for (int i = 0; i < FSP.loopBound(src1.length); i += FSP.length()) { > res += FloatVector.fromArray(FSP, src1, i) > .compare(VectorOperators.GE, FloatVector.fromArray(FSP, src2, i)) > .cast(DoubleVector.SPECIES_256) > .cast(FloatVector.SPECIES_128) > .toLong(); > } > return res * ctr; > } > > public static void main(String [] args) { > float [] src1 = new float[1024]; > float [] src2 = new float[1024]; > > IntStream.range(0, src1.length).forEach(i -> {src1[i] = (float)i;}); > IntStream.range(0, src2.length).forEach(i -> {src2[i] = (float)500;}); > > long res = 0; > for (int i = 0; i < 100000; i++) { > res += micro(src1, src2, i); > } > long t1 = System.currentTimeMillis(); > for (int i = 0; i < 100000; i++) { > res += micro(src1, src2, i); > } > long t2 = System.currentTimeMillis(); > System.out.println("[time] " + (t2 - t1) + "ms" + " [res] " + res); > } > } Ok~ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220925155 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220905045 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220919188 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220924181 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220927997 From xgong at openjdk.org Tue Jul 22 06:20:34 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 22 Jul 2025 06:20:34 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 01:23:43 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Disable auto-vectorization of double to short conversion for NEON and update tests ping~ Hi @theRealAph, could you please help take a look at the latest commit? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3101230913 From wanghaomin at openjdk.org Tue Jul 22 07:27:28 2025 From: wanghaomin at openjdk.org (Wang Haomin) Date: Tue, 22 Jul 2025 07:27:28 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD Message-ID: Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. ------------- Commit messages: - 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD Changes: https://git.openjdk.org/jdk/pull/26423/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26423&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362972 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26423.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26423/head:pull/26423 PR: https://git.openjdk.org/jdk/pull/26423 From aph at openjdk.org Tue Jul 22 07:54:29 2025 From: aph at openjdk.org (Andrew Haley) Date: Tue, 22 Jul 2025 07:54:29 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 01:23:43 GMT, Xiaohong Gong wrote: >> ### Background >> On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. >> >> For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. >> >> To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. >> >> ### Impact Analysis >> #### 1. Vector types >> Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. >> >> #### 2. Vector API >> No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. >> >> #### 3. Auto-vectorization >> Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. >> >> #### 4. Codegen of vector nodes >> NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. >> >> Details: >> - Lanewise vector operations are unaffected as explained above. >> - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). >> - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, addin... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Disable auto-vectorization of double to short conversion for NEON and update tests Looks good, thanks. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26057#pullrequestreview-3041615796 From xgong at openjdk.org Tue Jul 22 07:54:29 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 22 Jul 2025 07:54:29 GMT Subject: RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors [v4] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 07:50:48 GMT, Andrew Haley wrote: > Looks good, thanks. Thanks so much for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26057#issuecomment-3101502718 From bkilambi at openjdk.org Tue Jul 22 08:09:31 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Tue, 22 Jul 2025 08:09:31 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector In-Reply-To: References: <03mNhjjP_PvR9nxPUCaIkN5NF--gH7-AMqiHJlAzJW0=.e0e1cd1e-f236-4a6d-b9da-1459eed6077d@github.com> Message-ID: On Fri, 6 Jun 2025 01:24:28 GMT, Xiaohong Gong wrote: >>> Good job @Bhavana-Kilambi ! Generally looks good to me. Just some minor issues that I have left the comments. Besides, could you please add some IR tests for this optimization? Thanks! >> >> Hi @XiaohongGong , there are tests already for this operation under `jdk/jdk/incubator/vector` for all the types and sizes to verify the results. Did you mean IR tests for verifying if the correct backend match rule is being generated ? > >> > Good job @Bhavana-Kilambi ! Generally looks good to me. Just some minor issues that I have left the comments. Besides, could you please add some IR tests for this optimization? Thanks! >> >> Hi @XiaohongGong , there are tests already for this operation under `jdk/jdk/incubator/vector` for all the types and sizes to verify the results. Did you mean IR tests for verifying if the correct backend match rule is being generated ? > > Yes, I think adding an IR check tests for this operation will be better. I think checking the mid-end IR is enough. Hi @XiaohongGong @theRealAph @shqking Can I please ask for another round of review for the latest patch? Thanks in advance! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-3101552923 From mchevalier at openjdk.org Tue Jul 22 08:16:33 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 22 Jul 2025 08:16:33 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth In-Reply-To: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> Message-ID: On Fri, 4 Jul 2025 21:47:24 GMT, Saranya Natarajan wrote: > **Issue** > Extreme values for BciProfileWidth flag such as `java -XX:BciProfileWidth=-1 -version` and `java -XX:BciProfileWidth=100000 -version `results in assert failure `assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. This is observed in a x86 machine. > > **Analysis** > On debugging the issue, I found that increasing the size of the interpreter using the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` prevented the above mentioned assert from failing for large values of BciProfileWidth. > > **Proposal** > Considering the fact that larger BciProfileWidth results in slower profiling, I have proposed a range between 0 to 5000 to restrict the value for BciProfileWidth for x86 machines. This maximum value is based on modifying the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` using the smallest `InterpreterCodeSize` for all the architectures. As for the lower bound, a value of -1 would be the same as 0, as this simply means no return bci's will be recorded in ret profile. > > **Issue in AArch64** > Additionally running the command `java -XX:BciProfileWidth= 10000 -version` (or larger values) results in a different failure `assert(offset_ok_for_immed(offset(), size)) failed: must be, was: 32768, 3` on an AArch64 machine.This is an issue of maximum offset for `ldr/str` in AArch64 which can be fixed using `form_address` as mentioned in [JDK-8342736](https://bugs.openjdk.org/browse/JDK-8342736). In my preliminary fix using `form_address` on AArch64 machine. I had to modify 3 `ldr` and 1 `str` instruction (in file `src/hotspot/cpu/aarch64/interp_masm_aarch64.cpp` at line number 926, 983, and 997). With this fix using `form_address`, `BciProfileWidth` works for maximum of 5000 after which it crashes with`assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. Without this fix `BciProfileWidth` works for a maximum value of 1300. Currently, I have suggested to restrict the upper bound on AArch64 to 1000 instead of fixing it with `form_address`. > > **Question to reviewers** > Do you think this is a reasonable fix ? For AArch64 do you suggest fixing using `form_address` ? If yes, do I fix it under this PR or create another one ? One nit, one open comment. Not of a lot of opinion on whether to use `form_address` instead. Since `BciProfileWidth` is a develop flag, I'm not too annoyed if we limit it to avoid some change that would affect product builds. Except of course if the offset issue is a deeper problem that deserves to be solved anyway. src/hotspot/share/runtime/globals.hpp line 1354: > 1352: range(0, 8) \ > 1353: \ > 1354: develop(intx, BciProfileWidth, 2, \ Recently, I've seen someone complaining about useless use of `intx`, saying that is brings less readability than a more fixed-width type when not needed. Here, [0, 5000] fits in 16 bits (even signed). One could change that into a simple `int` or something like that. src/hotspot/share/runtime/globals.hpp line 1357: > 1355: "Number of return bci's to record in ret profile") \ > 1356: range(0, AARCH64_ONLY(1000) NOT_AARCH64(5000)) \ > 1357: \ Maybe that's one empty line too much (cf. other spacing just around). ------------- PR Review: https://git.openjdk.org/jdk/pull/26139#pullrequestreview-3041683534 PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2221609770 PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2221600897 From xgong at openjdk.org Tue Jul 22 08:19:34 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 22 Jul 2025 08:19:34 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v16] In-Reply-To: References: Message-ID: <6hto0G_9vUo7YEPmaxArwIxndMaktX74csGZLApe5Nc=.fc5dab03-f357-4d08-8870-bdf070b5bf45@github.com> On Mon, 21 Jul 2025 11:09:04 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in c2_MacroAssembler_aarch64.cpp LGTM! Thanks for your updating! ------------- Marked as reviewed by xgong (Committer). PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-3041750591 From roland at openjdk.org Tue Jul 22 08:31:48 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 22 Jul 2025 08:31:48 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? Message-ID: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> A node in a pre loop only has uses out of the loop dominated by the loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control to the loop exit projection. A range check in the main loop has this node as input (through a chain of some other nodes). Range check elimination needs to update the exit condition of the pre loop with an expression that depends on the node pinned on its exit: that's impossible and the assert fires. This is a variant of 8314024 (this one was for a node with uses out of the pre loop on multiple paths). I propose the same fix: leave the node with control in the pre loop in this case. ------------- Commit messages: - tests - fix Changes: https://git.openjdk.org/jdk/pull/26424/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361702 Stats: 178 lines in 4 files changed: 160 ins; 7 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/26424.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26424/head:pull/26424 PR: https://git.openjdk.org/jdk/pull/26424 From roland at openjdk.org Tue Jul 22 08:38:56 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 22 Jul 2025 08:38:56 GMT Subject: Integrated: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops In-Reply-To: References: Message-ID: On Tue, 22 Oct 2024 07:48:33 GMT, Roland Westrelin wrote: > To optimize a long counted loop and long range checks in a long or int > counted loop, the loop is turned into a loop nest. When the loop has > few iterations, the overhead of having an outer loop whose backedge is > never taken, has a measurable cost. Furthermore, creating the loop > nest usually causes one iteration of the loop to be peeled so > predicates can be set up. If the loop is short running, then it's an > extra iteration that's run with range checks (compared to an int > counted loop with int range checks). > > This change doesn't create a loop nest when: > > 1- it can be determined statically at loop nest creation time that the > loop runs for a short enough number of iterations > > 2- profiling reports that the loop runs for no more than ShortLoopIter > iterations (1000 by default). > > For 2-, a guard is added which is implemented as yet another predicate. > > While this change is in principle simple, I ran into a few > implementation issues: > > - while c2 has a way to compute the number of iterations of an int > counted loop, it doesn't have that for long counted loop. The > existing logic for int counted loops promotes values to long to > avoid overflows. I reworked it so it now works for both long and int > counted loops. > > - I added a new deoptimization reason (Reason_short_running_loop) for > the new predicate. Given the number of iterations is narrowed down > by the predicate, the limit of the loop after transformation is a > cast node that's control dependent on the short running loop > predicate. Because once the counted loop is transformed, it is > likely that range check predicates will be inserted and they will > depend on the limit, the short running loop predicate has to be the > one that's further away from the loop entry. Now it is also possible > that the limit before transformation depends on a predicate > (TestShortRunningLongCountedLoopPredicatesClone is an example), we > can have: new predicates inserted after the transformation that > depend on the casted limit that itself depend on old predicates > added before the transformation. To solve this cicular dependency, > parse and assert predicates are cloned between the old predicates > and the loop head. The cloned short running loop parse predicate is > the one that's used to insert the short running loop predicate. > > - In the case of a long counted loop, the loop is transformed into a > regular loop with a new limit and transformed range checks that's > later turned into an in counted loop. The int ... This pull request has now been integrated. Changeset: f1556611 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/f155661151fc25cde3be17878aeb24056555961c Stats: 1688 lines in 27 files changed: 1609 ins; 23 del; 56 mod 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops Co-authored-by: Maurizio Cimadamore Co-authored-by: Christian Hagedorn Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/21630 From roland at openjdk.org Tue Jul 22 08:38:54 2025 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 22 Jul 2025 08:38:54 GMT Subject: RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v37] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 11:53:41 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> test failures > > Great! Sure, I've submitted another round of testing. Will report back again. @chhagedorn thanks for the review and testing @TobiHartmann thanks for the review @eme64 thanks for the comments ------------- PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-3101656783 From thartmann at openjdk.org Tue Jul 22 08:40:25 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Jul 2025 08:40:25 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 07:21:15 GMT, Wang Haomin wrote: > Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. Hi @haominw, could you please add a regression test for this issue? Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3101670533 From mchevalier at openjdk.org Tue Jul 22 08:51:35 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 22 Jul 2025 08:51:35 GMT Subject: RFR: 8347901: C2 should remove unused leaf / pure runtime calls [v5] In-Reply-To: References: Message-ID: On Wed, 9 Jul 2025 12:36:31 GMT, Marc Chevalier wrote: >> A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. >> >> ## Pure Functions >> >> Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. >> >> ## Scope >> >> We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. >> >> ## Implementation Overview >> >> We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. >> >> This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. >> >> Thanks, >> Marc > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Tentative to address Tobias' comments Thanks all for your comments! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25760#issuecomment-3101702054 From mchevalier at openjdk.org Tue Jul 22 08:51:37 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Tue, 22 Jul 2025 08:51:37 GMT Subject: Integrated: 8347901: C2 should remove unused leaf / pure runtime calls In-Reply-To: References: Message-ID: On Wed, 11 Jun 2025 16:18:41 GMT, Marc Chevalier wrote: > A first part toward a better support of pure functions, but this time, with guidance from @iwanowww. > > ## Pure Functions > > Pure functions (considered here) are functions that have no side effects, no effect on the control flow (no exception or such), cannot deopt etc.. It's really a function that you can execute anywhere, with whichever arguments without effect other than wasting time. Integer division is not pure as dividing by zero is throwing. But many floating point functions will just return `NaN` or `+/-infinity` in problematic cases. > > ## Scope > > We are not going all powerful for now! It's mostly about identifying some pure functions and being able to remove them if the result is unused. Some other things are not part of this PR, on purpose. Especially, this PR doesn't propose a way to move pure calls around. The reason is that pure calls are later expanded into regular calls, which require a control input. To be able to do the expansion, we just keep the control in the pure call as well. > > ## Implementation Overview > > We created here some new node kind for pure calls, inheriting leaf calls, that are expanded into regular leaf calls during final graph reshaping. The possibility to support pure call directly in AD file is left open. > > This PR also introduces `TupleNode` (largely based on an original idea/implem of @iwanowww), that just tie multiple input together and play well with `ProjNode`: the n-th projection of a `TupleNode` is the n-th input of the tuple. This is a convenient way to skip and remove nodes from the graph while delegating the difficulty of the surgery to the trusted IGVN's implementation. > > Thanks, > Marc This pull request has now been integrated. Changeset: ed70910b Author: Marc Chevalier URL: https://git.openjdk.org/jdk/commit/ed70910b0f3e1b19d915ec13ac3434407d01bc5d Stats: 343 lines in 15 files changed: 198 ins; 61 del; 84 mod 8347901: C2 should remove unused leaf / pure runtime calls Reviewed-by: thartmann, vlivanov ------------- PR: https://git.openjdk.org/jdk/pull/25760 From jbhateja at openjdk.org Tue Jul 22 08:57:26 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 22 Jul 2025 08:57:26 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: <-ZNeXOcmEACkhP4QKXKnWWEiT6ucjPY7Zz1HqvMeAoI=.c8fae49e-fcb0-41fb-84d1-4aa52ee83790@github.com> References: <-ZNeXOcmEACkhP4QKXKnWWEiT6ucjPY7Zz1HqvMeAoI=.c8fae49e-fcb0-41fb-84d1-4aa52ee83790@github.com> Message-ID: <1RlpmwLAF5ALeZQRS_DAqixgD6MUno5cUbguqHTlUU0=.6a594ecc-b8f7-489d-b801-a41e87d1deeb@github.com> On Tue, 22 Jul 2025 03:01:43 GMT, erifan wrote: > Do you mean this check `Matcher::match_rule_supported_vector(opc, vlen, maskall_bt)` ? I think it's necessary ? Because in theory some platforms don't support both `MaskAll` and `Replicate`. Of course, this situation may not exist in reality. If `MaskAll` and `Replicate` are not supported, then `VectorLongToMask` should not be supported either, and this function will not be called. My suggestion was to check for Op_Replicate here as Op_MaskAll is already checked underneath VectorNode::scalar2vector under an assumption that MaskAll is a special case for replicate applicable to masks ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2221783983 From duke at openjdk.org Tue Jul 22 09:07:26 2025 From: duke at openjdk.org (erifan) Date: Tue, 22 Jul 2025 09:07:26 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: <1RlpmwLAF5ALeZQRS_DAqixgD6MUno5cUbguqHTlUU0=.6a594ecc-b8f7-489d-b801-a41e87d1deeb@github.com> References: <-ZNeXOcmEACkhP4QKXKnWWEiT6ucjPY7Zz1HqvMeAoI=.c8fae49e-fcb0-41fb-84d1-4aa52ee83790@github.com> <1RlpmwLAF5ALeZQRS_DAqixgD6MUno5cUbguqHTlUU0=.6a594ecc-b8f7-489d-b801-a41e87d1deeb@github.com> Message-ID: On Tue, 22 Jul 2025 08:54:23 GMT, Jatin Bhateja wrote: >> Do you mean this check `Matcher::match_rule_supported_vector(opc, vlen, maskall_bt)` ? I think it's necessary ? Because in theory some platforms don't support both `MaskAll` and `Replicate`. Of course, this situation may not exist in reality. If `MaskAll` and `Replicate` are not supported, then `VectorLongToMask` should not be supported either, and this function will not be called. > >> Do you mean this check `Matcher::match_rule_supported_vector(opc, vlen, maskall_bt)` ? I think it's necessary ? Because in theory some platforms don't support both `MaskAll` and `Replicate`. Of course, this situation may not exist in reality. If `MaskAll` and `Replicate` are not supported, then `VectorLongToMask` should not be supported either, and this function will not be called. > > My suggestion was to check for Op_Replicate here as Op_MaskAll is already checked underneath VectorNode::scalar2vector under an assumption that MaskAll is a special case for replicate applicable to masks Oh I misunderstood what you meant, now I understand, thank you! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2221829278 From xgong at openjdk.org Tue Jul 22 09:09:36 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 22 Jul 2025 09:09:36 GMT Subject: Integrated: 8359419: AArch64: Relax min vector length to 32-bit for short vectors In-Reply-To: References: Message-ID: On Tue, 1 Jul 2025 05:59:15 GMT, Xiaohong Gong wrote: > ### Background > On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions. > > For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size. > > To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors. > > ### Impact Analysis > #### 1. Vector types > Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change. > > #### 2. Vector API > No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length. > > #### 3. Auto-vectorization > Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks. > > #### 4. Codegen of vector nodes > NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored. > > Details: > - Lanewise vector operations are unaffected as explained above. > - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE). > - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_s... This pull request has now been integrated. Changeset: ac141c2f Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/ac141c2fa1d818858e7a12a50837bb282282ecac Stats: 359 lines in 10 files changed: 231 ins; 9 del; 119 mod 8359419: AArch64: Relax min vector length to 32-bit for short vectors Reviewed-by: aph, fgao, bkilambi, dlunden ------------- PR: https://git.openjdk.org/jdk/pull/26057 From wanghaomin at openjdk.org Tue Jul 22 09:14:24 2025 From: wanghaomin at openjdk.org (Wang Haomin) Date: Tue, 22 Jul 2025 09:14:24 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 08:38:03 GMT, Tobias Hartmann wrote: > Hi @haominw, could you please add a regression test for this issue? Thanks. Hi, test/hotspot/jtreg/compiler/intrinsics/TestDoubleIsFinite.java and TestFloatIsFinite.java can trigger this issue. I encountered the issue while adding the matcher `match(Set dst (CMoveI (Binary cop (CmpI (IsFiniteF op) zero)) (Binary src1 src2)));` on riscv ad file. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3101798232 From duke at openjdk.org Tue Jul 22 10:04:33 2025 From: duke at openjdk.org (erifan) Date: Tue, 22 Jul 2025 10:04:33 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Mon, 14 Jul 2025 11:17:41 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Addressed review comments to half the number of match rules Just some minor comments, not a block. src/hotspot/cpu/aarch64/aarch64.ad line 923: > 921: V24, V24_H, V24_J, V24_K > 922: ); > 923: Not a big matter, but it looks better to me if you can move this change `after line 810` of this file. src/hotspot/cpu/aarch64/aarch64.ad line 5091: > 5089: format %{ %} > 5090: interface(REG_INTER); > 5091: %} Ditto, I tend to moving this change `after line 5101` of this file. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 5181: > 5179: %}')dnl > 5180: dnl > 5181: Remove this blank otherwise two blank lines will be generated. See `src/hotspot/cpu/aarch64/aarch64_vector.ad` line 7180 and line 7181 ------------- PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-3042166970 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2221959277 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2221970019 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2221941918 From jbhateja at openjdk.org Tue Jul 22 10:28:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 22 Jul 2025 10:28:11 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v16] In-Reply-To: <93ZKsln0-lAlg3-KYPCFRIZfT5gm8I6ebltkWWRLzVY=.8af16ece-4c80-42ff-9ddb-70036ecd6290@github.com> References: <93ZKsln0-lAlg3-KYPCFRIZfT5gm8I6ebltkWWRLzVY=.8af16ece-4c80-42ff-9ddb-70036ecd6290@github.com> Message-ID: On Tue, 22 Jul 2025 10:24:43 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update intrinsicnode.cpp Hi @kvn, Apart from couple of lines of fix, patch mainly re-structured existing handling and added additional comments and proofs along with an exhaustive test. So effective code changes are very minimal. Should I check-in this in jdk-mainline and then prepare minimal fix (if needed) for jdk25 ? ------------- PR Review: https://git.openjdk.org/jdk/pull/23947#pullrequestreview-3042257403 From jbhateja at openjdk.org Tue Jul 22 10:28:11 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 22 Jul 2025 10:28:11 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v16] In-Reply-To: References: Message-ID: <93ZKsln0-lAlg3-KYPCFRIZfT5gm8I6ebltkWWRLzVY=.8af16ece-4c80-42ff-9ddb-70036ecd6290@github.com> > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Update intrinsicnode.cpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23947/files - new: https://git.openjdk.org/jdk/pull/23947/files/4f33d4b4..161487e6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=15 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=14-15 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/23947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23947/head:pull/23947 PR: https://git.openjdk.org/jdk/pull/23947 From jbhateja at openjdk.org Tue Jul 22 10:28:12 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 22 Jul 2025 10:28:12 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: On Wed, 16 Jul 2025 10:43:34 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Refine lower bound computation src/hotspot/share/opto/intrinsicnode.cpp line 358: > 356: assert(lo == (bt == T_INT ? min_jint : min_jlong) || lo == 0, ""); > 357: > 358: if (src_type->hi_as_long() >= 0) { Inorder to check for non-negative non-constant src_type, check should be against the lower bound i.e. src_type->lo_as_long() >= 0 since C2's integral types (TypeInt/TypeLong) maintains an invariant that _lo > _hi for non-constant values, iff _lo == _hi then it's a singleton value or a constant. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23947#discussion_r2222010818 From thartmann at openjdk.org Tue Jul 22 10:48:23 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Jul 2025 10:48:23 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 07:21:15 GMT, Wang Haomin wrote: > Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. We run these test regularly but we didn't observe the issue. Are you running with any non-default VM flags? Does it only reproduce on RISCV? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3102193035 From galder at openjdk.org Tue Jul 22 10:49:33 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 22 Jul 2025 10:49:33 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 14:38:59 GMT, Feilong Jiang wrote: >> src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp line 775: >> >>> 773: arraycopy_helper(x, &flags, &expected_type); >>> 774: if (x->check_flag(Instruction::OmitChecksFlag)) { >>> 775: flags = (flags & (LIR_OpArrayCopy::unaligned | LIR_OpArrayCopy::overlapping)); >> >> The changes in the two files need to be in synch, so I wonder if `LIR_OpArrayCopy::unaligned | LIR_OpArrayCopy::overlapping` could be abstracted away within a function in `LIR_OpArrayCopy`. >> >> So something like this (apologies for any syntactic/semantic errors): >> >> >> flags = (flags & LIR_OpArrayGopy::get_array_copy_flags()); >> >> >> Then on the other method something like: >> >> >> ((flags & ~(LIR_OpArrayGopy::get_array_copy_flags())) == 0) >> >> >> Function name is just an example, feel free to suggest some other if you think it fits better. >> >> Thoughts? > > Adding new flag check routines seems like a good idea, but it's a bit challenging to choose a name, as there are too many flags for `LIR_OPArrayCopy`[1]. Perhaps something like `should_check_unaligned_or_overlapping` would be suitable? > > 1. https://github.com/openjdk/jdk/blob/15b5b54ac707ba0d4e473fd6eb02c38a8efe705c/src/hotspot/share/c1/c1_LIR.hpp#L1257-L1271 Hmmm, I don't think I like that name. It's too specific on the flags but does not convey what it's used for. The aim of `flag=0` was to avoid instantiation of array copy stubs, so maybe the name could be `init_flags_for_copy_stubs`? It could be prepended with a `get_` if needs be to avoid confusion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2222146055 From jbhateja at openjdk.org Tue Jul 22 10:49:41 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 22 Jul 2025 10:49:41 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 07:19:42 GMT, Tobias Hartmann wrote: > This looks good to me and I think you addressed all the comments that Emanuel had. Let's wait for another day or two in case someone else wants to take a look as well. > > In the meantime, please request approval for integration into JDK 25 since we are know at RDP 2: https://openjdk.org/jeps/3#Fix-Request-Process Hi @TobiHartmann, Can you please re-verify the latest version of patch and approve if all tests are green. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3102195153 From fyang at openjdk.org Tue Jul 22 11:08:26 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 22 Jul 2025 11:08:26 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v3] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 08:52:38 GMT, Hamlin Li wrote: >> src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 190: >> >>> 188: assert(code != nullptr, "Could not find the containing code blob"); >>> 189: >>> 190: address dest = MacroAssembler::target_addr_for_insn(call_addr); >> >> Is this change safe? Seems it modifies the original logic. > > Yes, `MacroAssembler::pd_call_destination` only call `MacroAssembler::target_addr_for_insn`. > And `MacroAssembler::target_addr_for_insn` are used in other places in NativeFarCall, so it's better to use `target_addr_for_insn` only to improve readability. There seems to be a subtle differnece here. I see `MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination` which calls `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)`. After this change, that condition is gone. I haven't looked into how this may make a difference. I see this function was introduce by JDK-8332689, maybe @robehn could comment? [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/nativeInst_riscv.cpp#L112 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2222185471 From wanghaomin at openjdk.org Tue Jul 22 11:20:29 2025 From: wanghaomin at openjdk.org (Wang Haomin) Date: Tue, 22 Jul 2025 11:20:29 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 10:46:15 GMT, Tobias Hartmann wrote: > We run these test regularly but we didn't observe the issue. Are you running with any non-default VM flags? Does it only reproduce on RISCV? Only adding a matcher like `match(Set dst (CMoveI (Binary cop (CmpI (IsFiniteF op) zero)) (Binary src1 src2)));` will trigger this issue. There are no issues with the default VM. I saw that `IsInfinite` has already been added to the non-truncating list, so I wanted to add `IsFinite` as well. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3102285982 From mli at openjdk.org Tue Jul 22 11:55:27 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 22 Jul 2025 11:55:27 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v3] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 11:04:56 GMT, Fei Yang wrote: > I see `MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination` which calls `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)`. Can you clarify "`MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination`"? Robbin is on vacation for weeks, so I'm afraid he's not going to reponse in time. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2222283203 From mli at openjdk.org Tue Jul 22 11:59:25 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 22 Jul 2025 11:59:25 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v3] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 11:52:44 GMT, Hamlin Li wrote: >> There seems to be a subtle differnece here. I see `MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination` which calls `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)`. After this change, that condition is gone. I haven't looked into how this may make a difference. >> >> I see this function was introduce by JDK-8332689, maybe @robehn could comment? >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/nativeInst_riscv.cpp#L112 > >> I see `MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination` which calls `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)`. > > Can you clarify "`MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination`"? > Robbin is on vacation for weeks, so I'm afraid he's not going to reponse in time. I don't think this pr changes the logic in `NativeFarCall::reloc_destination`, am I right? Or maybe you're misled by the name change from `stub_address` to `reloc_destination_without_check` and existing method `reloc_destination()`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2222291033 From snatarajan at openjdk.org Tue Jul 22 13:03:10 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 22 Jul 2025 13:03:10 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v2] In-Reply-To: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> Message-ID: > **Issue** > Extreme values for BciProfileWidth flag such as `java -XX:BciProfileWidth=-1 -version` and `java -XX:BciProfileWidth=100000 -version `results in assert failure `assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. This is observed in a x86 machine. > > **Analysis** > On debugging the issue, I found that increasing the size of the interpreter using the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` prevented the above mentioned assert from failing for large values of BciProfileWidth. > > **Proposal** > Considering the fact that larger BciProfileWidth results in slower profiling, I have proposed a range between 0 to 5000 to restrict the value for BciProfileWidth for x86 machines. This maximum value is based on modifying the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` using the smallest `InterpreterCodeSize` for all the architectures. As for the lower bound, a value of -1 would be the same as 0, as this simply means no return bci's will be recorded in ret profile. > > **Issue in AArch64** > Additionally running the command `java -XX:BciProfileWidth= 10000 -version` (or larger values) results in a different failure `assert(offset_ok_for_immed(offset(), size)) failed: must be, was: 32768, 3` on an AArch64 machine.This is an issue of maximum offset for `ldr/str` in AArch64 which can be fixed using `form_address` as mentioned in [JDK-8342736](https://bugs.openjdk.org/browse/JDK-8342736). In my preliminary fix using `form_address` on AArch64 machine. I had to modify 3 `ldr` and 1 `str` instruction (in file `src/hotspot/cpu/aarch64/interp_masm_aarch64.cpp` at line number 926, 983, and 997). With this fix using `form_address`, `BciProfileWidth` works for maximum of 5000 after which it crashes with`assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. Without this fix `BciProfileWidth` works for a maximum value of 1300. Currently, I have suggested to restrict the upper bound on AArch64 to 1000 instead of fixing it with `form_address`. > > **Question to reviewers** > Do you think this is a reasonable fix ? For AArch64 do you suggest fixing using `form_address` ? If yes, do I fix it under this PR or create another one ? Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: addressing review comment 1 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26139/files - new: https://git.openjdk.org/jdk/pull/26139/files/6a36457d..a32b6ead Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26139.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26139/head:pull/26139 PR: https://git.openjdk.org/jdk/pull/26139 From snatarajan at openjdk.org Tue Jul 22 13:03:11 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Tue, 22 Jul 2025 13:03:11 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v2] In-Reply-To: References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> Message-ID: <-UlWQi6Pf7UwQKUR8sL4_Rhoj9MEd8UMlH7naG_W7QM=.d8947291-df9e-40e8-8007-4438c6c490c3@github.com> On Tue, 22 Jul 2025 08:08:28 GMT, Marc Chevalier wrote: >> Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: >> >> addressing review comment 1 > > src/hotspot/share/runtime/globals.hpp line 1354: > >> 1352: range(0, 8) \ >> 1353: \ >> 1354: develop(intx, BciProfileWidth, 2, \ > > Recently, I've seen someone complaining about useless use of `intx`, saying that is brings less readability than a more fixed-width type when not needed. Here, [0, 5000] fits in 16 bits (even signed). One could change that into a simple `int` or something like that. Since `int` seems to fit the range. I have changed `intx` to `int ` > src/hotspot/share/runtime/globals.hpp line 1357: > >> 1355: "Number of return bci's to record in ret profile") \ >> 1356: range(0, AARCH64_ONLY(1000) NOT_AARCH64(5000)) \ >> 1357: \ > > Maybe that's one empty line too much (cf. other spacing just around). Thank you. I fixed this. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2222457832 PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2222456543 From thartmann at openjdk.org Tue Jul 22 14:27:27 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Jul 2025 14:27:27 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: <15Vd5Kn403cC3nHdk8UWhTwAKoJZQ00eT-VKkZn0iL8=.ae0e3372-8f2d-40f3-96a9-1b251260386a@github.com> On Tue, 22 Jul 2025 07:21:15 GMT, Wang Haomin wrote: > Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. Marked as reviewed by thartmann (Reviewer). Okay, thanks for the details. The fix looks good to me. ------------- PR Review: https://git.openjdk.org/jdk/pull/26423#pullrequestreview-3043269319 PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3102992278 From fjiang at openjdk.org Tue Jul 22 14:35:26 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 22 Jul 2025 14:35:26 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 10:46:47 GMT, Galder Zamarre?o wrote: >> Adding new flag check routines seems like a good idea, but it's a bit challenging to choose a name, as there are too many flags for `LIR_OPArrayCopy`[1]. Perhaps something like `should_check_unaligned_or_overlapping` would be suitable? >> >> 1. https://github.com/openjdk/jdk/blob/15b5b54ac707ba0d4e473fd6eb02c38a8efe705c/src/hotspot/share/c1/c1_LIR.hpp#L1257-L1271 > > Hmmm, I don't think I like that name. It's too specific on the flags but does not convey what it's used for. The aim of `flag=0` was to avoid instantiation of array copy stubs, so maybe the name could be `init_flags_for_copy_stubs`? It could be prepended with a `get_` if needs be to avoid confusion. `init_flags_for_copy_stubs` appears to be misleading, as it suggests we want to generate array copy stubs for those flags. How about `get_necessary_copy_flags`? We can add other flags if needed in the future. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2222749793 From thartmann at openjdk.org Tue Jul 22 14:35:36 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Jul 2025 14:35:36 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 18:56:46 GMT, Tobias Hartmann wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Refine lower bound computation > > Thanks, testing looks good now! I'm out for the rest of the week and can review only next week. > Hi @TobiHartmann, Can you please re-verify the latest version of patch and approve if all tests are green. Sure, I'll re-run testing. How did you find the issue that you fixed in the latest commit? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3103028817 From thartmann at openjdk.org Tue Jul 22 14:35:37 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 22 Jul 2025 14:35:37 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: <6XRbYO3L5zyEIH2MuKSntl2IdAG2zgNVNm_0PakEI9g=.40266744-d0d0-421f-b5ff-17fd0117b693@github.com> References: <6XRbYO3L5zyEIH2MuKSntl2IdAG2zgNVNm_0PakEI9g=.40266744-d0d0-421f-b5ff-17fd0117b693@github.com> Message-ID: <6LUCFCu5hkbtz24H3-KIU5OCbaOsutU0HUakZhZEdzY=.31d897c4-fe36-4c3b-a8f7-7f4a713a0cf0@github.com> On Mon, 21 Jul 2025 20:21:25 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Refine lower bound computation > > I am not sure about JDK 25 approval for these changes. > > Can you do simple fix for JDK 25 as @merykitty suggested: "I suggest removing all the logic and simply returning the bottom type" ? Will it be the same complexity? Will it affect performance (and how much)? @vnkozlov Given that this is an old issue already affecting JDK 19, maybe we should just defer to JDK 26 for now and then backport to JDK 25u only once the fix is stable? ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3103037832 From galder at openjdk.org Tue Jul 22 15:37:09 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Tue, 22 Jul 2025 15:37:09 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: References: Message-ID: <-nkeoFoOTMWI6RfM_Jer0VWfjpFQI4Ky8_IUtlplDQE=.ee03ce55-ffae-4bf6-b237-9ad8d5a93d58@github.com> On Tue, 22 Jul 2025 14:32:24 GMT, Feilong Jiang wrote: >> Hmmm, I don't think I like that name. It's too specific on the flags but does not convey what it's used for. The aim of `flag=0` was to avoid instantiation of array copy stubs, so maybe the name could be `init_flags_for_copy_stubs`? It could be prepended with a `get_` if needs be to avoid confusion. > > `init_flags_for_copy_stubs` appears to be misleading, as it suggests we want to generate array copy stubs for those flags. > How about `get_necessary_copy_flags`? We can add other flags if needed in the future. I'm unsure about the use of the word `necessary`. What about `get_initial_copy_flags`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2222930750 From fyang at openjdk.org Tue Jul 22 15:37:20 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 22 Jul 2025 15:37:20 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v3] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 11:56:35 GMT, Hamlin Li wrote: >>> I see `MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination` which calls `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)`. >> >> Can you clarify "`MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination`"? >> Robbin is on vacation for weeks, so I'm afraid he's not going to reponse in time. > > I don't think this pr changes the logic in `NativeFarCall::reloc_destination`, am I right? > Or maybe you're misled by the name change from `stub_address` to `reloc_destination_without_check` and existing method `reloc_destination()`? > > I see `MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination` which calls `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)`. > > Can you clarify "`MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination`"? Robbin is on vacation for weeks, so I'm afraid he's not going to reponse in time. Sorry for not being clear. Let me clarify. I mean this code snippet at [1]: 74 address Relocation::pd_call_destination(address orig_addr) { 75 assert(is_call(), "should be a call here"); 76 if (NativeCall::is_at(addr())) { 77 return nativeCall_at(addr())->reloc_destination(); <====================== 78 } 79 80 if (orig_addr != nullptr) { Before this change, we call `MacroAssembler::pd_call_destination` here. And `NativeFarCall::reloc_destination` at L77 will only call `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)` [2]. But we will always/unconditionally call `MacroAssembler::target_addr_for_insn` here after this change. That seems make a difference? Did I miss anything? [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/relocInfo_riscv.cpp#L77 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/nativeInst_riscv.cpp#L112 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2222880013 From kxu at openjdk.org Tue Jul 22 15:37:34 2025 From: kxu at openjdk.org (Kangcheng Xu) Date: Tue, 22 Jul 2025 15:37:34 GMT Subject: RFR: 8353290: C2: Refactor PhaseIdealLoop::is_counted_loop() [v3] In-Reply-To: References: Message-ID: <2WHjT7pRiNmnEhn10X2stupsaA5R4uYI1SAgsgTb7Qo=.16d561b8-9afd-4e21-8518-c6b77382c9c3@github.com> On Thu, 19 Jun 2025 16:30:51 GMT, Christian Hagedorn wrote: >> Resolved conflict with [JDK-8357951](https://bugs.openjdk.org/browse/JDK-8357951). @chhagedorn I'd appreciate a re-review. Thank you so much! > > Thanks @tabjy for coming back with an update and pinging me again! Sorry, I completely missed it the first time. I will be on vacation starting tomorrow for two weeks but I'm happy to take another look when I'm back :-) Resolved conflicts. @chhagedorn would kindly take a look? Thank you so much! ------------- PR Comment: https://git.openjdk.org/jdk/pull/24458#issuecomment-3103390781 From fjiang at openjdk.org Tue Jul 22 15:50:25 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 22 Jul 2025 15:50:25 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v6] In-Reply-To: References: Message-ID: > Hi, please consider. > [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. > The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. > If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. > This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. > We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. > > This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. > The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. > > Test on linux-riscv64: > - [x] Tier1-3 > > JMH data on P550 SBC for reference (w/o and w/ the patch): > > Before: > > Without COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op > ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op > ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op > ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op > ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op > ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op > ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op > ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op > ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op > ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op > ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op > ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op > ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op > ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op > ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op > > ------------------------------------------------------------------------- > With COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.667 ? 0.622 ns/op > Arra... Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - Add get_initial_copy_flags - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - also keep overlapping flag - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - Revert RISCV Macro modification - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses - riscv: fix c1 primitive array clone intrinsic regression ------------- Changes: https://git.openjdk.org/jdk/pull/25976/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=05 Stats: 5 lines in 3 files changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/25976.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25976/head:pull/25976 PR: https://git.openjdk.org/jdk/pull/25976 From kvn at openjdk.org Tue Jul 22 16:06:09 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 22 Jul 2025 16:06:09 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: <6XRbYO3L5zyEIH2MuKSntl2IdAG2zgNVNm_0PakEI9g=.40266744-d0d0-421f-b5ff-17fd0117b693@github.com> References: <6XRbYO3L5zyEIH2MuKSntl2IdAG2zgNVNm_0PakEI9g=.40266744-d0d0-421f-b5ff-17fd0117b693@github.com> Message-ID: On Mon, 21 Jul 2025 20:21:25 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> Refine lower bound computation > > I am not sure about JDK 25 approval for these changes. > > Can you do simple fix for JDK 25 as @merykitty suggested: "I suggest removing all the logic and simply returning the bottom type" ? Will it be the same complexity? Will it affect performance (and how much)? > @vnkozlov Given that this is an old issue already affecting JDK 19, maybe we should just defer to JDK 26 for now and then backport to JDK 25u only once the fix is stable? Yes, I agree. Please replace fix request with defer request in JBS bug report. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3103585873 From dlong at openjdk.org Tue Jul 22 19:59:57 2025 From: dlong at openjdk.org (Dean Long) Date: Tue, 22 Jul 2025 19:59:57 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v6] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 15:50:25 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Add get_initial_copy_flags > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - also keep overlapping flag > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - Revert RISCV Macro modification > - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone > - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses > - riscv: fix c1 primitive array clone intrinsic regression Marked as reviewed by dlong (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25976#pullrequestreview-3044686749 From duke at openjdk.org Tue Jul 22 22:20:56 2025 From: duke at openjdk.org (duke) Date: Tue, 22 Jul 2025 22:20:56 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v5] In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 17:17:07 GMT, Srinivas Vamsi Parasa wrote: >> This PR adds support for the Push-Pop Acceleration (PPX) hint to legacy PUSH and POP instructions, enabling the PUSHP and POPP forms. The PPX hint improves performance by accelerating register value forwarding between matching push/pop pairs. >> >> **Purpose:** PPX is a performance hint that allows the processor to bypass memory and the training loop of Fast Store Forwarding Predictor (FSFP) by directly forwarding data between paired PUSHP and POPP instructions. >> >> **Requirements:** Both the PUSH and its matching POP must be marked with PPX. A "matching" pair accesses the same stack address (e.g., typical function prolog/epilog). Standalone PUSH instructions (e.g. for argument passing) must not be marked. >> >> **Encoding:** PUSHP/POPP is a replacement for legacy PUSH/POP (0x50+rd / 0x58+rd) and uses REX2.W = 1 (implies 64-bit operand size). PPX cannot be encoded with 16-bit operand size as REX2.W overrides the 0x66 prefix. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > change to push_ppx/pop_ppx @vamsi-parasa Your change (at version 78cbf2430d2a8179d97201f799026747a38367a4) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25889#issuecomment-3104989703 From kvn at openjdk.org Tue Jul 22 22:40:07 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 22 Jul 2025 22:40:07 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v39] In-Reply-To: References: Message-ID: <1S6OFTVdnqqXUthfAtaa4PMg9AI6iA6a1xC2Un2yROk=.3c3ae976-1c3c-4a0d-8ef6-7963d105cff2@github.com> On Tue, 22 Jul 2025 01:05:53 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [ ] Linux x64 fastdebug all >> - [ ] Linux aarch64 fastdebug all >> - [ ] ... > > Chad Rakoczy has updated the pull request incrementally with five additional commits since the last revision: > > - Fix spacing > - Update NMethod.java with immutable data changes > - Rename method to nm > - Add assert before freeing immutable data > - Reorder is_relocatable checks Looks good to me now. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23573#pullrequestreview-3045062588 From fjiang at openjdk.org Wed Jul 23 01:08:04 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 23 Jul 2025 01:08:04 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v5] In-Reply-To: <-nkeoFoOTMWI6RfM_Jer0VWfjpFQI4Ky8_IUtlplDQE=.ee03ce55-ffae-4bf6-b237-9ad8d5a93d58@github.com> References: <-nkeoFoOTMWI6RfM_Jer0VWfjpFQI4Ky8_IUtlplDQE=.ee03ce55-ffae-4bf6-b237-9ad8d5a93d58@github.com> Message-ID: On Tue, 22 Jul 2025 15:25:25 GMT, Galder Zamarre?o wrote: >> `init_flags_for_copy_stubs` appears to be misleading, as it suggests we want to generate array copy stubs for those flags. >> How about `get_necessary_copy_flags`? We can add other flags if needed in the future. > > I'm unsure about the use of the word `necessary`. What about `get_initial_copy_flags`? `get_initial_copy_flags` looks fine, I have added it to `LIR_OpArrayCopy`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25976#discussion_r2224099302 From wanghaomin at openjdk.org Wed Jul 23 01:24:53 2025 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 23 Jul 2025 01:24:53 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: <15Vd5Kn403cC3nHdk8UWhTwAKoJZQ00eT-VKkZn0iL8=.ae0e3372-8f2d-40f3-96a9-1b251260386a@github.com> References: <15Vd5Kn403cC3nHdk8UWhTwAKoJZQ00eT-VKkZn0iL8=.ae0e3372-8f2d-40f3-96a9-1b251260386a@github.com> Message-ID: <4sAK_7XL1elXYXPKPbqmZpwmrQ5DfXanPaTkSBtfEw4=.d49d3155-fc14-4597-b3ae-c38f32191098@github.com> On Tue, 22 Jul 2025 14:25:03 GMT, Tobias Hartmann wrote: >> Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. > > Okay, thanks for the details. The fix looks good to me. @TobiHartmann Thanks. Could you push it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3105315236 From jbhateja at openjdk.org Wed Jul 23 01:52:54 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 23 Jul 2025 01:52:54 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v17] In-Reply-To: References: Message-ID: > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Update intrinsicnode.cpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23947/files - new: https://git.openjdk.org/jdk/pull/23947/files/161487e6..68e24cf8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=16 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23947&range=15-16 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/23947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23947/head:pull/23947 PR: https://git.openjdk.org/jdk/pull/23947 From jbhateja at openjdk.org Wed Jul 23 02:00:05 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 23 Jul 2025 02:00:05 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v15] In-Reply-To: <6LUCFCu5hkbtz24H3-KIU5OCbaOsutU0HUakZhZEdzY=.31d897c4-fe36-4c3b-a8f7-7f4a713a0cf0@github.com> References: <6XRbYO3L5zyEIH2MuKSntl2IdAG2zgNVNm_0PakEI9g=.40266744-d0d0-421f-b5ff-17fd0117b693@github.com> <6LUCFCu5hkbtz24H3-KIU5OCbaOsutU0HUakZhZEdzY=.31d897c4-fe36-4c3b-a8f7-7f4a713a0cf0@github.com> Message-ID: On Tue, 22 Jul 2025 14:32:57 GMT, Tobias Hartmann wrote: >> I am not sure about JDK 25 approval for these changes. >> >> Can you do simple fix for JDK 25 as @merykitty suggested: "I suggest removing all the logic and simply returning the bottom type" ? Will it be the same complexity? Will it affect performance (and how much)? > > @vnkozlov Given that this is an old issue already affecting JDK 19, maybe we should just defer to JDK 26 for now and then backport to JDK 25u only once the fix is stable? > > Hi @TobiHartmann, Can you please re-verify the latest version of patch and approve if all tests are green. > > Sure, I'll re-run testing. How did you find the issue that you fixed in the latest commit? Hi @TobiHartmann , On a re-review I found this incompatibility between code and comments, its not a correctness issue but trying to constrain the result value range further :-) Hi @vnkozlov , On performance front, I see 500x improvement with and without patch for following micro. {C9168262-BEFD-43FE-A2DF-E935FA312A41} {60D7988B-3DBC-4C84-AC13-68E858360045} public class test_compress { public static int micro(int src, int mask) { src = Math.max(0, Math.min(5, src)); int cond = Integer.compress(src, mask); if (cond < 0) { throw new AssertionError("Unexpected control path"); } return cond; } public static void main(String [] args) { int res = 0; for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) { res = micro(i, i); } long t1 = System.currentTimeMillis(); for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) { res = micro(i, i); } long t2 = System.currentTimeMillis(); System.out.println("[time] " + (t2-t1) + " ms [res] " + res); } } ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3105371975 From aoqi at openjdk.org Wed Jul 23 02:59:03 2025 From: aoqi at openjdk.org (Ao Qi) Date: Wed, 23 Jul 2025 02:59:03 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 Message-ID: Configure with `--with-jvm-variants=minimal --with-debug-level=slowdebug`. Error message: ... Compiling up to 136 files for BUILD_java.compiler.interim Creating support/modules_libs/java.base/minimal/libjvm.so from 628 file(s) /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/hotspot/variant-minimal/libjvm/objs/macroAssembler_x86.o: in function `AOTCodeCache::is_on_for_dump()': /home/aoqi/work/openjdk/jdk/src/hotspot/share/code/aotCodeCache.hpp:383: undefined reference to `AOTCodeCache::_cache' collect2: error: ld returned 1 exit status gmake[3]: *** [lib/CompileJvm.gmk:175: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/support/modules_libs/java.base/minimal/libjvm.so] Error 1 gmake[2]: *** [make/Main.gmk:242: hotspot-minimal-libs] Error 1 gmake[2]: *** Waiting for unfinished jobs.... `AOTCodeCache::is_on_for_dump()` is used in `macroAssembler_x86.cpp` but not defined when cds is disabled. ------------- Commit messages: - 8363895: Minimal build fails with slowdebug builds after JDK-8354887 Changes: https://git.openjdk.org/jdk/pull/26436/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26436&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8363895 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26436.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26436/head:pull/26436 PR: https://git.openjdk.org/jdk/pull/26436 From yadongwang at openjdk.org Wed Jul 23 03:06:53 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Wed, 23 Jul 2025 03:06:53 GMT Subject: RFR: 8362838: RISC-V: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 14:02:08 GMT, Feilong Jiang wrote: > Same as [JDK-8361892](https://bugs.openjdk.org/browse/JDK-8361892), but for riscv. > > Testing: > - [x] Tier1-3 & hotspot:tier4 on linux-riscv64 LGTM. Can you confirm the generated code in the case of non-legal addresses for byte_map_base? ------------- PR Review: https://git.openjdk.org/jdk/pull/26318#pullrequestreview-3045416106 From dzhang at openjdk.org Wed Jul 23 03:38:55 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 23 Jul 2025 03:38:55 GMT Subject: RFR: 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV Message-ID: Hi all, Please take a look and review this PR, thanks! test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java fails without RVV after [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) in fastdebug mode. VectorAPI needs vector intrinsic in this case, so RVV needs to be enabled on RISC-V. ### Test (fastdebug) - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on k1 and sg2042 - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on qemu-system w/ and w/o RVV ------------- Commit messages: - 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV Changes: https://git.openjdk.org/jdk/pull/26437/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26437&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8363898 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26437.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26437/head:pull/26437 PR: https://git.openjdk.org/jdk/pull/26437 From fyang at openjdk.org Wed Jul 23 03:43:57 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 23 Jul 2025 03:43:57 GMT Subject: RFR: 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 03:32:26 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java fails without RVV after [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) in fastdebug mode. > > VectorAPI needs vector intrinsic in this case, so RVV needs to be enabled on RISC-V. > > ### Test (fastdebug) > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on k1 and sg2042 > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on qemu-system w/ and w/o RVV Looks fine. Thanks for fixing this test. ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26437#pullrequestreview-3045475368 From dzhang at openjdk.org Wed Jul 23 03:51:58 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Wed, 23 Jul 2025 03:51:58 GMT Subject: RFR: 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 03:32:26 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java fails without RVV after [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) in fastdebug mode. > > In [JDK-8291669](https://bugs.openjdk.org/browse/JDK-8291669), which introduced this case, it is mentioned: >>Previously attached jtreg case fails on ppc64 because VectorAPI has no >>vector intrinsics on ppc64 so there's no long range check to hoist. In >>this patch, we limit the test architecture to x64 and AArch64. > > So we need RVV to use vector intrinsics on RISC-V. > > ### Test (fastdebug) > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on k1 and sg2042 > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on qemu-system w/ and w/o RVV Hi @Hamlin-Li , could you help to review this patch? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26437#issuecomment-3105570798 From fjiang at openjdk.org Wed Jul 23 04:19:54 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 23 Jul 2025 04:19:54 GMT Subject: RFR: 8362596: RISC-V: Improve _vectorizedHashCode intrinsic In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 08:07:48 GMT, Yuri Gaevsky wrote: > This is a micro-optimization for RISC-V SpacemiT K1 CPU to fix [encountered performance regression](https://github.com/openjdk/jdk/pull/17413#issuecomment-3082664335). Thanks for finding this! ------------- Marked as reviewed by fjiang (Committer). PR Review: https://git.openjdk.org/jdk/pull/26409#pullrequestreview-3045529365 From haosun at openjdk.org Wed Jul 23 04:45:07 2025 From: haosun at openjdk.org (Hao Sun) Date: Wed, 23 Jul 2025 04:45:07 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v16] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 11:09:04 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in c2_MacroAssembler_aarch64.cpp Marked as reviewed by haosun (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-3045577632 From kvn at openjdk.org Wed Jul 23 05:39:58 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 23 Jul 2025 05:39:58 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 02:16:46 GMT, Ao Qi wrote: > Configure with `--with-jvm-variants=minimal --with-debug-level=slowdebug`. > > Error message: > > ... > Compiling up to 136 files for BUILD_java.compiler.interim > Creating support/modules_libs/java.base/minimal/libjvm.so from 628 file(s) > /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/hotspot/variant-minimal/libjvm/objs/macroAssembler_x86.o: in function `AOTCodeCache::is_on_for_dump()': > /home/aoqi/work/openjdk/jdk/src/hotspot/share/code/aotCodeCache.hpp:383: undefined reference to `AOTCodeCache::_cache' > collect2: error: ld returned 1 exit status > gmake[3]: *** [lib/CompileJvm.gmk:175: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/support/modules_libs/java.base/minimal/libjvm.so] Error 1 > gmake[2]: *** [make/Main.gmk:242: hotspot-minimal-libs] Error 1 > gmake[2]: *** Waiting for unfinished jobs.... > > > `AOTCodeCache::is_on_for_dump()` is used in `macroAssembler_x86.cpp` but not defined when cds is disabled. src/hotspot/share/code/aotCodeCache.hpp line 382: > 380: static void close() NOT_CDS_RETURN; > 381: static bool is_on() CDS_ONLY({ return cache() != nullptr && !_cache->closing(); }) NOT_CDS_RETURN_(false); > 382: static bool is_on_for_use() { return is_on() && _cache->for_use(); } This one also should use macros. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26436#discussion_r2224410023 From fjiang at openjdk.org Wed Jul 23 05:59:36 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 23 Jul 2025 05:59:36 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v7] In-Reply-To: References: Message-ID: <2-0RIz4ucTeGZ90La3716py33X4u8-Vj-4-WqjC_jck=.afe9207b-7093-4a72-82c8-924abcb1054a@github.com> > Hi, please consider. > [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. > The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. > If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. > This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. > We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. > > This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. > The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. > > Test on linux-riscv64: > - [x] Tier1-3 > > JMH data on P550 SBC for reference (w/o and w/ the patch): > > Before: > > Without COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op > ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op > ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op > ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op > ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op > ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op > ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op > ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op > ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op > ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op > ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op > ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op > ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op > ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op > ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op > > ------------------------------------------------------------------------- > With COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.667 ? 0.622 ns/op > Arra... Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: fix build ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25976/files - new: https://git.openjdk.org/jdk/pull/25976/files/cc2a329f..657f92f9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25976&range=05-06 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/25976.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25976/head:pull/25976 PR: https://git.openjdk.org/jdk/pull/25976 From fjiang at openjdk.org Wed Jul 23 06:06:54 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 23 Jul 2025 06:06:54 GMT Subject: RFR: 8362838: RISC-V: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 03:04:38 GMT, Yadong Wang wrote: >> Same as [JDK-8361892](https://bugs.openjdk.org/browse/JDK-8361892), but for riscv. >> >> Testing: >> - [x] Tier1-3 & hotspot:tier4 on linux-riscv64 > > LGTM. Can you confirm the generated code in the case of non-legal addresses for byte_map_base? @yadongw Thanks for the review! > LGTM. Can you confirm the generated code in the case of non-legal addresses for byte_map_base? Yes. I added some code to allocate the JNI handler around the byte_map_base (https://github.com/feilongjiang/jdk/commit/4c9b39d5657b5dc3fb52d78c73797da8348aeca2). We can reproduce the crash on RISC-V, and the crash is gone after applying the fix. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26318#issuecomment-3105877365 From fyang at openjdk.org Wed Jul 23 06:29:56 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 23 Jul 2025 06:29:56 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v7] In-Reply-To: <2-0RIz4ucTeGZ90La3716py33X4u8-Vj-4-WqjC_jck=.afe9207b-7093-4a72-82c8-924abcb1054a@github.com> References: <2-0RIz4ucTeGZ90La3716py33X4u8-Vj-4-WqjC_jck=.afe9207b-7093-4a72-82c8-924abcb1054a@github.com> Message-ID: On Wed, 23 Jul 2025 05:59:36 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: > > fix build Marked as reviewed by fyang (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/25976#pullrequestreview-3045849105 From aoqi at openjdk.org Wed Jul 23 06:33:39 2025 From: aoqi at openjdk.org (Ao Qi) Date: Wed, 23 Jul 2025 06:33:39 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 [v2] In-Reply-To: References: Message-ID: > Configure with `--with-jvm-variants=minimal --with-debug-level=slowdebug`. > > Error message: > > ... > Compiling up to 136 files for BUILD_java.compiler.interim > Creating support/modules_libs/java.base/minimal/libjvm.so from 628 file(s) > /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/hotspot/variant-minimal/libjvm/objs/macroAssembler_x86.o: in function `AOTCodeCache::is_on_for_dump()': > /home/aoqi/work/openjdk/jdk/src/hotspot/share/code/aotCodeCache.hpp:383: undefined reference to `AOTCodeCache::_cache' > collect2: error: ld returned 1 exit status > gmake[3]: *** [lib/CompileJvm.gmk:175: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/support/modules_libs/java.base/minimal/libjvm.so] Error 1 > gmake[2]: *** [make/Main.gmk:242: hotspot-minimal-libs] Error 1 > gmake[2]: *** Waiting for unfinished jobs.... > > > `AOTCodeCache::is_on_for_dump()` is used in `macroAssembler_x86.cpp` but not defined when cds is disabled. Ao Qi has updated the pull request incrementally with one additional commit since the last revision: missing macros for is_on_for_use() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26436/files - new: https://git.openjdk.org/jdk/pull/26436/files/03c4269b..dc329c36 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26436&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26436&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26436.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26436/head:pull/26436 PR: https://git.openjdk.org/jdk/pull/26436 From aoqi at openjdk.org Wed Jul 23 06:33:39 2025 From: aoqi at openjdk.org (Ao Qi) Date: Wed, 23 Jul 2025 06:33:39 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 [v2] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 05:37:22 GMT, Vladimir Kozlov wrote: >> Ao Qi has updated the pull request incrementally with one additional commit since the last revision: >> >> missing macros for is_on_for_use() > > src/hotspot/share/code/aotCodeCache.hpp line 382: > >> 380: static void close() NOT_CDS_RETURN; >> 381: static bool is_on() CDS_ONLY({ return cache() != nullptr && !_cache->closing(); }) NOT_CDS_RETURN_(false); >> 382: static bool is_on_for_use() { return is_on() && _cache->for_use(); } > > This one also should use macros. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26436#discussion_r2224540114 From thartmann at openjdk.org Wed Jul 23 06:39:59 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 23 Jul 2025 06:39:59 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: <4sAK_7XL1elXYXPKPbqmZpwmrQ5DfXanPaTkSBtfEw4=.d49d3155-fc14-4597-b3ae-c38f32191098@github.com> References: <15Vd5Kn403cC3nHdk8UWhTwAKoJZQ00eT-VKkZn0iL8=.ae0e3372-8f2d-40f3-96a9-1b251260386a@github.com> <4sAK_7XL1elXYXPKPbqmZpwmrQ5DfXanPaTkSBtfEw4=.d49d3155-fc14-4597-b3ae-c38f32191098@github.com> Message-ID: On Wed, 23 Jul 2025 01:22:23 GMT, Wang Haomin wrote: >> Okay, thanks for the details. The fix looks good to me. > > @TobiHartmann Thanks. Could you push it? @haominw You need a second review (maybe @jaskarth?) and then you first need to do `/integrate` before a committer can sponsor your change (see 11. of https://openjdk.org/guide/#life-of-a-pr). ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3106039682 From thartmann at openjdk.org Wed Jul 23 06:53:03 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 23 Jul 2025 06:53:03 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v17] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 01:52:54 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update intrinsicnode.cpp Still looks good to me. I'll re-run testing with the latest changes. @jatin-bhateja could you please adjust your "Fix request" comment in JBS to be a "Defer request", see https://openjdk.org/jeps/3#Bug-Deferral-Process ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23947#pullrequestreview-3045910095 From wanghaomin at openjdk.org Wed Jul 23 06:56:59 2025 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 23 Jul 2025 06:56:59 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 07:21:15 GMT, Wang Haomin wrote: > Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. @jaskarth Could you review this change? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3106096188 From mli at openjdk.org Wed Jul 23 07:33:53 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 23 Jul 2025 07:33:53 GMT Subject: RFR: 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 03:32:26 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java fails without RVV after [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) in fastdebug mode. > > In [JDK-8291669](https://bugs.openjdk.org/browse/JDK-8291669), which introduced this case, it is mentioned: >>Previously attached jtreg case fails on ppc64 because VectorAPI has no >>vector intrinsics on ppc64 so there's no long range check to hoist. In >>this patch, we limit the test architecture to x64 and AArch64. > > So we need RVV to use vector intrinsics on RISC-V. > > ### Test (fastdebug) > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on k1 and sg2042 > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on qemu-system w/ and w/o RVV Thank you for fixing this! looks good. ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26437#pullrequestreview-3046043263 From mli at openjdk.org Wed Jul 23 07:51:04 2025 From: mli at openjdk.org (Hamlin Li) Date: Wed, 23 Jul 2025 07:51:04 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v3] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 15:14:22 GMT, Fei Yang wrote: >> I don't think this pr changes the logic in `NativeFarCall::reloc_destination`, am I right? >> Or maybe you're misled by the name change from `stub_address` to `reloc_destination_without_check` and existing method `reloc_destination()`? > >> > I see `MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination` which calls `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)`. >> >> Can you clarify "`MacroAssembler::pd_call_destination` delegates work to `NativeFarCall::reloc_destination`"? Robbin is on vacation for weeks, so I'm afraid he's not going to reponse in time. > > Sorry for not being clear. Let me clarify. I mean this code snippet at [1]: > > > 74 address Relocation::pd_call_destination(address orig_addr) { > 75 assert(is_call(), "should be a call here"); > 76 if (NativeCall::is_at(addr())) { > 77 return nativeCall_at(addr())->reloc_destination(); <====================== > 78 } > 79 > 80 if (orig_addr != nullptr) { > > Before this change, we call `MacroAssembler::pd_call_destination` here. > And `NativeFarCall::reloc_destination` at L77 will only call `MacroAssembler::target_addr_for_insn` under condition `if (stub_addr != nullptr)` [2]. But we will always/unconditionally call `MacroAssembler::target_addr_for_insn` here after this change. That seems make a difference? Did I miss anything? > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/relocInfo_riscv.cpp#L77 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/nativeInst_riscv.cpp#L112 The call path was: `Relocation::pd_call_destination(address orig_addr)` => `NativeCall::reloc_destination()` => `NativeFarCall::reloc_destination()` After change, it's: `Relocation::pd_call_destination(address orig_addr)` => `NativeCall::reloc_destination()` => `RelocCall::reloc_destination()` The code path does not change except that `NativeFarCall::reloc_destination` is changed to `RelocCall::reloc_destination` and `assert(NativeFarCall...` is changed to `assert(RelocCall`. Seems it does not change any logic in this code path, but just names. > Before this change, we call MacroAssembler::pd_call_destination here. And NativeFarCall::reloc_destination at L77 will only call MacroAssembler::target_addr_for_insn under condition if (stub_addr != nullptr) The code between `NativeFarCall::reloc_destination` and `RelocCall::reloc_destination` is almost the same, except of names. > But we will always/unconditionally call MacroAssembler::target_addr_for_insn here after this change. No. I guess you might have mistoken `RelocCall::reloc_destination_without_check()` as `RelocCall::reloc_destination()`? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2224710718 From syan at openjdk.org Wed Jul 23 07:54:53 2025 From: syan at openjdk.org (SendaoYan) Date: Wed, 23 Jul 2025 07:54:53 GMT Subject: RFR: 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 03:32:26 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java fails without RVV after [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) in fastdebug mode. > > In [JDK-8291669](https://bugs.openjdk.org/browse/JDK-8291669), which introduced this case, it is mentioned: >>Previously attached jtreg case fails on ppc64 because VectorAPI has no >>vector intrinsics on ppc64 so there's no long range check to hoist. In >>this patch, we limit the test architecture to x64 and AArch64. > > So we need RVV to use vector intrinsics on RISC-V. > > ### Test (fastdebug) > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on k1 and sg2042 > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on qemu-system w/ and w/o RVV Marked as reviewed by syan (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26437#pullrequestreview-3046116433 From duke at openjdk.org Wed Jul 23 08:03:05 2025 From: duke at openjdk.org (duke) Date: Wed, 23 Jul 2025 08:03:05 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: <8lElzEMSLMK6teT4daIRdLHm4Ljemg6effWnCABEL9E=.2e39871d-6f15-435e-a3b8-9c19e8be4aec@github.com> On Tue, 22 Jul 2025 07:21:15 GMT, Wang Haomin wrote: > Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. @haominw Your change (at version 19bb3fd0443e651c7abde8042cb4f8e1ede441da) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3106370977 From jkarthikeyan at openjdk.org Wed Jul 23 08:03:05 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Wed, 23 Jul 2025 08:03:05 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 07:21:15 GMT, Wang Haomin wrote: > Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. Thanks for the ping, this looks good to me! Thanks for adding the nodes to the list. ------------- Marked as reviewed by jkarthikeyan (Committer). PR Review: https://git.openjdk.org/jdk/pull/26423#pullrequestreview-3046142465 From wanghaomin at openjdk.org Wed Jul 23 08:05:54 2025 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 23 Jul 2025 08:05:54 GMT Subject: RFR: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: <4e2jSf2RNwx3P0Tp8jMBucTKWfBQDV3dNOBsKrmFYD8=.fab5ec70-f0da-4a64-b47e-f7236cf5f71b@github.com> On Wed, 23 Jul 2025 07:58:04 GMT, Jasmine Karthikeyan wrote: >> Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. > > Thanks for the ping, this looks good to me! Thanks for adding the nodes to the list. @jaskarth Thanks. Could you push it? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26423#issuecomment-3106383788 From wanghaomin at openjdk.org Wed Jul 23 08:11:00 2025 From: wanghaomin at openjdk.org (Wang Haomin) Date: Wed, 23 Jul 2025 08:11:00 GMT Subject: Integrated: 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 07:21:15 GMT, Wang Haomin wrote: > Same as https://bugs.openjdk.org/browse/JDK-8362171 , so I've added `IsFiniteF`, `IsFiniteD` to the assert switch. This pull request has now been integrated. Changeset: 9f796da3 Author: Wang Haomin Committer: Jasmine Karthikeyan URL: https://git.openjdk.org/jdk/commit/9f796da3774b2e2f92dca178fdccd93989919256 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8362972: C2 fails with unexpected node in SuperWord truncation: IsFiniteF, IsFiniteD Reviewed-by: thartmann, jkarthikeyan ------------- PR: https://git.openjdk.org/jdk/pull/26423 From galder at openjdk.org Wed Jul 23 08:11:57 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Wed, 23 Jul 2025 08:11:57 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v7] In-Reply-To: <2-0RIz4ucTeGZ90La3716py33X4u8-Vj-4-WqjC_jck=.afe9207b-7093-4a72-82c8-924abcb1054a@github.com> References: <2-0RIz4ucTeGZ90La3716py33X4u8-Vj-4-WqjC_jck=.afe9207b-7093-4a72-82c8-924abcb1054a@github.com> Message-ID: On Wed, 23 Jul 2025 05:59:36 GMT, Feilong Jiang wrote: >> Hi, please consider. >> [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. >> The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. >> If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. >> This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. >> We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. >> >> This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. >> The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. >> >> Test on linux-riscv64: >> - [x] Tier1-3 >> >> JMH data on P550 SBC for reference (w/o and w/ the patch): >> >> Before: >> >> Without COH: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op >> ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op >> ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op >> ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op >> ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op >> ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op >> ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op >> ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op >> ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op >> ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op >> ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op >> ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op >> ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op >> ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op >> ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op >> ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op >> >> ------------------------------------------------------------------------- >> With COH: >> >> Benchmark (size) Mode Cnt Score Error Un... > > Feilong Jiang has updated the pull request incrementally with one additional commit since the last revision: > > fix build Looks good now, thanks for the fix @feilongjiang! ------------- Marked as reviewed by galder (Author). PR Review: https://git.openjdk.org/jdk/pull/25976#pullrequestreview-3046195250 From chagedorn at openjdk.org Wed Jul 23 08:47:55 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 23 Jul 2025 08:47:55 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 12:25:33 GMT, Beno?t Maillard wrote: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). > > The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: > - `ConvD2L->ConvL2D->ConvD2L` > - `ConvF2I->ConvI2F->ConvF2I` > - `ConvF2L->ConvL2F->ConvF2L` > - `ConvI2F->ConvF2I->ConvI2F` > > Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. > > This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) > > Thank you for reviewing! Two minor comments, otherwise, it looks good to me! test/hotspot/jtreg/compiler/c2/TestEliminateRedundantConversionSequences.java line 30: > 28: * simplified to a single ConvX2Y operation when applicable > 29: * VerifyIterativeGVN checks that this optimization was applied > 30: * @run main/othervm -XX:+IgnoreUnrecognizedVMOptions -XX:CompileCommand=quiet I think you can remove `quiet`: Suggestion: * @run main/othervm -XX:+IgnoreUnrecognizedVMOptions test/hotspot/jtreg/compiler/c2/TestEliminateRedundantConversionSequences.java line 94: > 92: > 93: public static void main(String[] strArr) { > 94: for (int i = 0; i < 50_000; ++i) { Do you really need 50000 iterations each? Would less also trigger the bug? ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26368#pullrequestreview-3046338959 PR Review Comment: https://git.openjdk.org/jdk/pull/26368#discussion_r2224845814 PR Review Comment: https://git.openjdk.org/jdk/pull/26368#discussion_r2224849384 From chagedorn at openjdk.org Wed Jul 23 08:51:55 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 23 Jul 2025 08:51:55 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 12:25:33 GMT, Beno?t Maillard wrote: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). > > The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: > - `ConvD2L->ConvL2D->ConvD2L` > - `ConvF2I->ConvI2F->ConvF2I` > - `ConvF2L->ConvL2F->ConvF2L` > - `ConvI2F->ConvF2I->ConvI2F` > > Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. > > This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) > > Thank you for reviewing! src/hotspot/share/opto/phaseX.cpp line 2565: > 2563: // ConvF2I->ConvI2F->ConvF2I > 2564: // ConvF2L->ConvL2F->ConvF2L > 2565: // ConvI2F->ConvF2I->ConvI2F Another thought: Since this is an incomplete list of variations (especially missing, for example, the I2D version while the I2F version is here), should we leave a comment about not being able to trigger issues with the other versions? Otherwise, it could suggest that it was just forgotten. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26368#discussion_r2224869135 From qxing at openjdk.org Wed Jul 23 09:31:18 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Wed, 23 Jul 2025 09:31:18 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v5] In-Reply-To: References: Message-ID: > The result of count leading/trailing zeros is always non-negative, and the maximum value is integer type's size in bits. In previous versions, when C2 can not know the operand value of a CLZ/CTZ node at compile time, it will generate a full-width integer type for its result. This can significantly affect the efficiency of code in some cases. > > This patch makes the type of CLZ/CTZ nodes more precise, to make C2 generate better code. For example, the following implementation runs ~115% faster on x86-64 with this patch: > > > public static int numberOfNibbles(int i) { > int mag = Integer.SIZE - Integer.numberOfLeadingZeros(i); > return Math.max((mag + 3) / 4, 1); > } > > > Testing: tier1, IR test Qizheng Xing has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Merge branch 'master' into enhance-clz-type - Move `TestCountBitsRange` to `compiler.c2.gvn` - Fix null checks - Narrow type bound - Use `BitsPerX` constant instead of `sizeof` - Make the type of count leading/trailing zero nodes more precise ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25928/files - new: https://git.openjdk.org/jdk/pull/25928/files/c965311b..2f9bca68 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25928&range=03-04 Stats: 51740 lines in 1709 files changed: 29830 ins; 11769 del; 10141 mod Patch: https://git.openjdk.org/jdk/pull/25928.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25928/head:pull/25928 PR: https://git.openjdk.org/jdk/pull/25928 From fjiang at openjdk.org Wed Jul 23 09:38:02 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 23 Jul 2025 09:38:02 GMT Subject: Integrated: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 In-Reply-To: References: Message-ID: On Wed, 25 Jun 2025 12:40:06 GMT, Feilong Jiang wrote: > Hi, please consider. > [JDK-8333154](https://bugs.openjdk.org/browse/JDK-8333154) Implemented C1 clone intrinsic that reuses arraycopy code for primitive arrays for RISC-V. > The new instruction flag `OmitChecksFlag` (introduced by [JDK-8302850](https://bugs.openjdk.org/browse/JDK-8302850)) is used to avoid instantiation of array copy stubs for primitive array clones. > If `OmitChecksFlag` is set, all flags (including the `unaligned` flag) will be cleared before generating the `LIR_OpArrayCopy` node. > This may lead to incorrect selection of the arraycopy function when `-XX:+UseCompactObjectHeaders` is enabled, causing the `unaligned` flag to be set for arraycopy. > We observed performance regression on P550 SBC through the corresponding JMH tests when COH is enabled. > > This pr keeps the `unaligned` flag on RISC-V to ensure the arraycopy function is selected correctly. > The other platforms are not affected as the flag is always `0` when `OmitChecksFlag` is true. > > Test on linux-riscv64: > - [x] Tier1-3 > > JMH data on P550 SBC for reference (w/o and w/ the patch): > > Before: > > Without COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.854 ? 0.379 ns/op > ArrayClone.byteArraycopy 10 avgt 15 74.294 ? 0.449 ns/op > ArrayClone.byteArraycopy 100 avgt 15 81.847 ? 0.082 ns/op > ArrayClone.byteArraycopy 1000 avgt 15 480.106 ? 0.369 ns/op > ArrayClone.byteClone 0 avgt 15 90.146 ? 0.299 ns/op > ArrayClone.byteClone 10 avgt 15 130.525 ? 0.384 ns/op > ArrayClone.byteClone 100 avgt 15 251.942 ? 0.122 ns/op > ArrayClone.byteClone 1000 avgt 15 407.580 ? 0.318 ns/op > ArrayClone.intArraycopy 0 avgt 15 49.984 ? 0.436 ns/op > ArrayClone.intArraycopy 10 avgt 15 76.302 ? 1.388 ns/op > ArrayClone.intArraycopy 100 avgt 15 267.487 ? 0.329 ns/op > ArrayClone.intArraycopy 1000 avgt 15 1157.444 ? 1.588 ns/op > ArrayClone.intClone 0 avgt 15 90.130 ? 0.257 ns/op > ArrayClone.intClone 10 avgt 15 183.619 ? 0.588 ns/op > ArrayClone.intClone 100 avgt 15 296.491 ? 0.246 ns/op > ArrayClone.intClone 1000 avgt 15 828.695 ? 1.501 ns/op > > ------------------------------------------------------------------------- > With COH: > > Benchmark (size) Mode Cnt Score Error Units > ArrayClone.byteArraycopy 0 avgt 15 50.667 ? 0.622 ns/op > Arra... This pull request has now been integrated. Changeset: e6ac956a Author: Feilong Jiang URL: https://git.openjdk.org/jdk/commit/e6ac956a7ac613b916c0dbfda7e57856c1b8a83c Stats: 5 lines in 3 files changed: 3 ins; 0 del; 2 mod 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 Reviewed-by: fyang, galder, dlong ------------- PR: https://git.openjdk.org/jdk/pull/25976 From fjiang at openjdk.org Wed Jul 23 09:38:01 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Wed, 23 Jul 2025 09:38:01 GMT Subject: RFR: 8360520: RISC-V: C1: Fix primitive array clone intrinsic regression after JDK-8333154 [v6] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 19:57:01 GMT, Dean Long wrote: >> Feilong Jiang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: >> >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Add get_initial_copy_flags >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - also keep overlapping flag >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - Revert RISCV Macro modification >> - Merge branch 'master' of https://github.com/openjdk/jdk into riscv-fix-c1-primitive-clone >> - check unaligned flag at LIR_OpArrayCopy to avoid using AvoidUnalignedAccesses >> - riscv: fix c1 primitive array clone intrinsic regression > > Marked as reviewed by dlong (Reviewer). @dean-long @RealFYang @galderz -- Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25976#issuecomment-3106722711 From mchevalier at openjdk.org Wed Jul 23 10:12:08 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 23 Jul 2025 10:12:08 GMT Subject: RFR: 8363357: Remove unused flag VerifyAdapterCalls Message-ID: It seems that the flag VerifyAdapterCalls is unused since [JDK-8350209](https://bugs.openjdk.org/browse/JDK-8350209), so pretty recently. Let's remove it, very direct, no trick. ------------- Commit messages: - Remove VerifyAdapterCalls Changes: https://git.openjdk.org/jdk/pull/26440/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26440&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8363357 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26440.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26440/head:pull/26440 PR: https://git.openjdk.org/jdk/pull/26440 From snatarajan at openjdk.org Wed Jul 23 11:05:56 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 23 Jul 2025 11:05:56 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v3] In-Reply-To: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> Message-ID: > **Issue** > Extreme values for BciProfileWidth flag such as `java -XX:BciProfileWidth=-1 -version` and `java -XX:BciProfileWidth=100000 -version `results in assert failure `assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. This is observed in a x86 machine. > > **Analysis** > On debugging the issue, I found that increasing the size of the interpreter using the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` prevented the above mentioned assert from failing for large values of BciProfileWidth. > > **Proposal** > Considering the fact that larger BciProfileWidth results in slower profiling, I have proposed a range between 0 to 5000 to restrict the value for BciProfileWidth for x86 machines. This maximum value is based on modifying the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` using the smallest `InterpreterCodeSize` for all the architectures. As for the lower bound, a value of -1 would be the same as 0, as this simply means no return bci's will be recorded in ret profile. > > **Issue in AArch64** > Additionally running the command `java -XX:BciProfileWidth= 10000 -version` (or larger values) results in a different failure `assert(offset_ok_for_immed(offset(), size)) failed: must be, was: 32768, 3` on an AArch64 machine.This is an issue of maximum offset for `ldr/str` in AArch64 which can be fixed using `form_address` as mentioned in [JDK-8342736](https://bugs.openjdk.org/browse/JDK-8342736). In my preliminary fix using `form_address` on AArch64 machine. I had to modify 3 `ldr` and 1 `str` instruction (in file `src/hotspot/cpu/aarch64/interp_masm_aarch64.cpp` at line number 926, 983, and 997). With this fix using `form_address`, `BciProfileWidth` works for maximum of 5000 after which it crashes with`assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. Without this fix `BciProfileWidth` works for a maximum value of 1300. Currently, I have suggested to restrict the upper bound on AArch64 to 1000 instead of fixing it with `form_address`. > > **Question to reviewers** > Do you think this is a reasonable fix ? For AArch64 do you suggest fixing using `form_address` ? If yes, do I fix it under this PR or create another one ? Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: fixing test failures due to intx -> int of BciProfileWidth ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26139/files - new: https://git.openjdk.org/jdk/pull/26139/files/a32b6ead..c3a85cd4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=01-02 Stats: 6 lines in 5 files changed: 0 ins; 2 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26139.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26139/head:pull/26139 PR: https://git.openjdk.org/jdk/pull/26139 From yadongwang at openjdk.org Wed Jul 23 11:07:54 2025 From: yadongwang at openjdk.org (Yadong Wang) Date: Wed, 23 Jul 2025 11:07:54 GMT Subject: RFR: 8362838: RISC-V: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 14:02:08 GMT, Feilong Jiang wrote: > Same as [JDK-8361892](https://bugs.openjdk.org/browse/JDK-8361892), but for riscv. > > Testing: > - [x] Tier1-3 & hotspot:tier4 on linux-riscv64 Marked as reviewed by yadongwang (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26318#pullrequestreview-3046818867 From chagedorn at openjdk.org Wed Jul 23 12:07:53 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 23 Jul 2025 12:07:53 GMT Subject: RFR: 8363357: Remove unused flag VerifyAdapterCalls In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 07:46:58 GMT, Marc Chevalier wrote: > It seems that the flag VerifyAdapterCalls is unused since [JDK-8350209](https://bugs.openjdk.org/browse/JDK-8350209), so pretty recently. > > Let's remove it, very direct, no trick. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26440#pullrequestreview-3047071545 From myankelevich at openjdk.org Wed Jul 23 13:01:58 2025 From: myankelevich at openjdk.org (Mikhail Yankelevich) Date: Wed, 23 Jul 2025 13:01:58 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v3] In-Reply-To: References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> Message-ID: On Wed, 23 Jul 2025 11:05:56 GMT, Saranya Natarajan wrote: >> **Issue** >> Extreme values for BciProfileWidth flag such as `java -XX:BciProfileWidth=-1 -version` and `java -XX:BciProfileWidth=100000 -version `results in assert failure `assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. This is observed in a x86 machine. >> >> **Analysis** >> On debugging the issue, I found that increasing the size of the interpreter using the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` prevented the above mentioned assert from failing for large values of BciProfileWidth. >> >> **Proposal** >> Considering the fact that larger BciProfileWidth results in slower profiling, I have proposed a range between 0 to 5000 to restrict the value for BciProfileWidth for x86 machines. This maximum value is based on modifying the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` using the smallest `InterpreterCodeSize` for all the architectures. As for the lower bound, a value of -1 would be the same as 0, as this simply means no return bci's will be recorded in ret profile. >> >> **Issue in AArch64** >> Additionally running the command `java -XX:BciProfileWidth= 10000 -version` (or larger values) results in a different failure `assert(offset_ok_for_immed(offset(), size)) failed: must be, was: 32768, 3` on an AArch64 machine.This is an issue of maximum offset for `ldr/str` in AArch64 which can be fixed using `form_address` as mentioned in [JDK-8342736](https://bugs.openjdk.org/browse/JDK-8342736). In my preliminary fix using `form_address` on AArch64 machine. I had to modify 3 `ldr` and 1 `str` instruction (in file `src/hotspot/cpu/aarch64/interp_masm_aarch64.cpp` at line number 926, 983, and 997). With this fix using `form_address`, `BciProfileWidth` works for maximum of 5000 after which it crashes with`assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. Without this fix `BciProfileWidth` works for a maximum value of 1300. Currently, I have suggested to restrict the upper bound on AArch64 to 1000 instead of fixing it with `form_address`. >> >> **Question to reviewers** >> Do you think this is a reasonable fix ? For AArch64 do you suggest fixing using `form_address` ? If yes, do I fix it under this PR or create another one ? > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > fixing test failures due to intx -> int of BciProfileWidth test/lib-test/jdk/test/whitebox/vm_flags/IntxTest.java line 36: > 34: * @author igor.ignatyev at oracle.com > 35: */ > 36: import jdk.test.lib.Platform; NIt: copyright date ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2225524788 From thartmann at openjdk.org Wed Jul 23 13:29:05 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 23 Jul 2025 13:29:05 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v17] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 01:52:54 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. >> >> Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. >> >> New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Update intrinsicnode.cpp Testing is all clean. I think this is good to go into JDK 26. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3108402632 From jbhateja at openjdk.org Wed Jul 23 13:34:12 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 23 Jul 2025 13:34:12 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v17] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 13:26:19 GMT, Tobias Hartmann wrote: > Testing is all clean. I think this is good to go into JDK 26. Thanks @TobiHartmann , integrating it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3108436810 From jbhateja at openjdk.org Wed Jul 23 13:34:13 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 23 Jul 2025 13:34:13 GMT Subject: Integrated: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value In-Reply-To: References: Message-ID: <3F6BoP7F_VfKrEzctUL2L6ueDl1hl6enz2vOG02j6Dk=.c43e5e04-8766-479b-aa3c-f39157f49aae@github.com> On Fri, 7 Mar 2025 17:37:36 GMT, Jatin Bhateja wrote: > Hi All, > > This bugfix patch fixes incorrect value computation for Integer/Long. compress APIs. > > Problems occur with a constant input and variable mask where the input's value is equal to the lower bound of the mask value., In this case, an erroneous value range estimation results in a constant value. Existing value routine first attempts to constant fold the compression operation if both input and compression mask are constant values; otherwise, it attempts to constrain the value range of result based on the upper and lower bounds of mask type. > > New IR test covers the issue reported in the bug report along with a case for value range based logic pruning. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: b02c1256 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/b02c1256768bc9983d4dba899cd19219e11a380a Stats: 849 lines in 3 files changed: 812 ins; 16 del; 21 mod 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value Co-authored-by: Emanuel Peter Reviewed-by: thartmann ------------- PR: https://git.openjdk.org/jdk/pull/23947 From kvn at openjdk.org Wed Jul 23 14:38:54 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 23 Jul 2025 14:38:54 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 [v2] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 06:33:39 GMT, Ao Qi wrote: >> Configure with `--with-jvm-variants=minimal --with-debug-level=slowdebug`. >> >> Error message: >> >> ... >> Compiling up to 136 files for BUILD_java.compiler.interim >> Creating support/modules_libs/java.base/minimal/libjvm.so from 628 file(s) >> /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/hotspot/variant-minimal/libjvm/objs/macroAssembler_x86.o: in function `AOTCodeCache::is_on_for_dump()': >> /home/aoqi/work/openjdk/jdk/src/hotspot/share/code/aotCodeCache.hpp:383: undefined reference to `AOTCodeCache::_cache' >> collect2: error: ld returned 1 exit status >> gmake[3]: *** [lib/CompileJvm.gmk:175: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/support/modules_libs/java.base/minimal/libjvm.so] Error 1 >> gmake[2]: *** [make/Main.gmk:242: hotspot-minimal-libs] Error 1 >> gmake[2]: *** Waiting for unfinished jobs.... >> >> >> `AOTCodeCache::is_on_for_dump()` is used in `macroAssembler_x86.cpp` but not defined when cds is disabled. > > Ao Qi has updated the pull request incrementally with one additional commit since the last revision: > > missing macros for is_on_for_use() Good. ------------- Marked as reviewed by kvn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26436#pullrequestreview-3047751072 From snatarajan at openjdk.org Wed Jul 23 15:18:39 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 23 Jul 2025 15:18:39 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v4] In-Reply-To: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> Message-ID: <_preMnRE0tqL476Pb8bPPfkixInRa-ZH5Qom7W70AW4=.a71e36da-d0e1-44e4-a3fe-9091460b813f@github.com> > **Issue** > Extreme values for BciProfileWidth flag such as `java -XX:BciProfileWidth=-1 -version` and `java -XX:BciProfileWidth=100000 -version `results in assert failure `assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. This is observed in a x86 machine. > > **Analysis** > On debugging the issue, I found that increasing the size of the interpreter using the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` prevented the above mentioned assert from failing for large values of BciProfileWidth. > > **Proposal** > Considering the fact that larger BciProfileWidth results in slower profiling, I have proposed a range between 0 to 5000 to restrict the value for BciProfileWidth for x86 machines. This maximum value is based on modifying the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` using the smallest `InterpreterCodeSize` for all the architectures. As for the lower bound, a value of -1 would be the same as 0, as this simply means no return bci's will be recorded in ret profile. > > **Issue in AArch64** > Additionally running the command `java -XX:BciProfileWidth= 10000 -version` (or larger values) results in a different failure `assert(offset_ok_for_immed(offset(), size)) failed: must be, was: 32768, 3` on an AArch64 machine.This is an issue of maximum offset for `ldr/str` in AArch64 which can be fixed using `form_address` as mentioned in [JDK-8342736](https://bugs.openjdk.org/browse/JDK-8342736). In my preliminary fix using `form_address` on AArch64 machine. I had to modify 3 `ldr` and 1 `str` instruction (in file `src/hotspot/cpu/aarch64/interp_masm_aarch64.cpp` at line number 926, 983, and 997). With this fix using `form_address`, `BciProfileWidth` works for maximum of 5000 after which it crashes with`assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. Without this fix `BciProfileWidth` works for a maximum value of 1300. Currently, I have suggested to restrict the upper bound on AArch64 to 1000 instead of fixing it with `form_address`. > > **Question to reviewers** > Do you think this is a reasonable fix ? For AArch64 do you suggest fixing using `form_address` ? If yes, do I fix it under this PR or create another one ? Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: fixing copyright ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26139/files - new: https://git.openjdk.org/jdk/pull/26139/files/c3a85cd4..2d0084ba Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26139.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26139/head:pull/26139 PR: https://git.openjdk.org/jdk/pull/26139 From snatarajan at openjdk.org Wed Jul 23 15:18:39 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Wed, 23 Jul 2025 15:18:39 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v4] In-Reply-To: <-UlWQi6Pf7UwQKUR8sL4_Rhoj9MEd8UMlH7naG_W7QM=.d8947291-df9e-40e8-8007-4438c6c490c3@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> <-UlWQi6Pf7UwQKUR8sL4_Rhoj9MEd8UMlH7naG_W7QM=.d8947291-df9e-40e8-8007-4438c6c490c3@github.com> Message-ID: On Tue, 22 Jul 2025 12:59:35 GMT, Saranya Natarajan wrote: >> src/hotspot/share/runtime/globals.hpp line 1354: >> >>> 1352: range(0, 8) \ >>> 1353: \ >>> 1354: develop(intx, BciProfileWidth, 2, \ >> >> Recently, I've seen someone complaining about useless use of `intx`, saying that is brings less readability than a more fixed-width type when not needed. Here, [0, 5000] fits in 16 bits (even signed). One could change that into a simple `int` or something like that. > > Since `int` seems to fit the range. I have changed `intx` to `int `. Currently, fixing some test failures. I will address the failures in next commit. I have now resolved the failing test cases. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2225921554 From kvn at openjdk.org Wed Jul 23 15:21:21 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 23 Jul 2025 15:21:21 GMT Subject: RFR: 8350896: Integer/Long.compress gets wrong type from CompressBitsNode::Value [v17] In-Reply-To: References: Message-ID: <1_vllGAMI458pIw8ZxfzzObNY2S6vSotABXfSmNWzIY=.8b476018-fd56-420c-98dd-1a2cd44b3c08@github.com> On Wed, 23 Jul 2025 13:29:26 GMT, Jatin Bhateja wrote: >> Testing is all clean. I think this is good to go into JDK 26. > >> Testing is all clean. I think this is good to go into JDK 26. > > Thanks @TobiHartmann , integrating it. Thank you, @jatin-bhateja , for providing performance numbers. I approved deference request. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23947#issuecomment-3109080513 From duke at openjdk.org Wed Jul 23 15:34:02 2025 From: duke at openjdk.org (duke) Date: Wed, 23 Jul 2025 15:34:02 GMT Subject: Withdrawn: 8324720: Instruction selection does not respect -XX:-UseBMI2Instructions flag In-Reply-To: References: Message-ID: On Fri, 23 May 2025 13:50:09 GMT, Saranya Natarajan wrote: > While executing a function performing `a >> b` operation with `?XX:-UseBMI2Instructions` flag, the generated code contains BMI2 instruction `sarx eax,esi,edx`. The expected output should not contain any BMI2 instruction. > > ### Analysis and solution > > As suggested by @merykitty in [JDK-8324720](https://bugs.openjdk.org/browse/JDK-8324720) , the initial idea was to make `VM_Version::supports_bmi2()` respect` UseBMI2Instructions `flag by disabling BMI2 feature when `UseBMI2Instructions` runtime flag is explicitly set to false. This fix is similar to how other runtime flags such as, `UseAPX` and `UseAVX`, enable or disable specific code and register set. However, some test failures were encountered while running tests on this fix. > > The first set of failures were caused by assertion check on `VM_Version::supports_bmi2()` statement while generating some BMI2 specific instructions. This was caused by the stub generator generating AVX-512 specific code that uses these BMI2 instructions. It should be noted that the `UseAVX` flag is set by default to the highest supported version available in x86 machine. This in turn allows AVX-512 specific code generation whenever possible. In order to not comprise the performance benefits of using AVX-512, the proposed fix only disables BMI2 feature if AVX-512 features are also disabled (or not available in the machine) along with the UseBMI2Instructions flag. > > The second failure occured in `compiler/intrinsics/sha/cli/TestUseSHA512IntrinsicsOptionOnSupportedCPU.java` where a warning "_Intrinsics for SHA-384 and SHA-512 crypto hash functions not available on this CPU_." was returned on a AMD64 machine that had support for SHA512. Looking into `compiler/testlibrary/sha/predicate/IntrinsicPredicates.java` it was found that the predicate for AMD64 was not in line with the changes introduced by [JDK-8341052](https://bugs.openjdk.org/browse/JDK-8341052) in commit [85c1aea](https://github.com/openjdk/jdk/pull/20633/commits/85c1aea90b10014aa34dfc902dff2bfd31bd70c0) . This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/25415 From shade at openjdk.org Wed Jul 23 17:00:00 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 23 Jul 2025 17:00:00 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 [v2] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 06:33:39 GMT, Ao Qi wrote: >> Configure with `--with-jvm-variants=minimal --with-debug-level=slowdebug`. >> >> Error message: >> >> ... >> Compiling up to 136 files for BUILD_java.compiler.interim >> Creating support/modules_libs/java.base/minimal/libjvm.so from 628 file(s) >> /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/hotspot/variant-minimal/libjvm/objs/macroAssembler_x86.o: in function `AOTCodeCache::is_on_for_dump()': >> /home/aoqi/work/openjdk/jdk/src/hotspot/share/code/aotCodeCache.hpp:383: undefined reference to `AOTCodeCache::_cache' >> collect2: error: ld returned 1 exit status >> gmake[3]: *** [lib/CompileJvm.gmk:175: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/support/modules_libs/java.base/minimal/libjvm.so] Error 1 >> gmake[2]: *** [make/Main.gmk:242: hotspot-minimal-libs] Error 1 >> gmake[2]: *** Waiting for unfinished jobs.... >> >> >> `AOTCodeCache::is_on_for_dump()` is used in `macroAssembler_x86.cpp` but not defined when cds is disabled. > > Ao Qi has updated the pull request incrementally with one additional commit since the last revision: > > missing macros for is_on_for_use() Looks fine, thanks. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26436#pullrequestreview-3048306246 From kvn at openjdk.org Wed Jul 23 17:40:01 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 23 Jul 2025 17:40:01 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 [v2] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 06:33:39 GMT, Ao Qi wrote: >> Configure with `--with-jvm-variants=minimal --with-debug-level=slowdebug`. >> >> Error message: >> >> ... >> Compiling up to 136 files for BUILD_java.compiler.interim >> Creating support/modules_libs/java.base/minimal/libjvm.so from 628 file(s) >> /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/hotspot/variant-minimal/libjvm/objs/macroAssembler_x86.o: in function `AOTCodeCache::is_on_for_dump()': >> /home/aoqi/work/openjdk/jdk/src/hotspot/share/code/aotCodeCache.hpp:383: undefined reference to `AOTCodeCache::_cache' >> collect2: error: ld returned 1 exit status >> gmake[3]: *** [lib/CompileJvm.gmk:175: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/support/modules_libs/java.base/minimal/libjvm.so] Error 1 >> gmake[2]: *** [make/Main.gmk:242: hotspot-minimal-libs] Error 1 >> gmake[2]: *** Waiting for unfinished jobs.... >> >> >> `AOTCodeCache::is_on_for_dump()` is used in `macroAssembler_x86.cpp` but not defined when cds is disabled. > > Ao Qi has updated the pull request incrementally with one additional commit since the last revision: > > missing macros for is_on_for_use() In Leyden premain branch we already have this. I forgot to port it to mainline. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26436#issuecomment-3109531143 From dlong at openjdk.org Wed Jul 23 19:55:34 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 23 Jul 2025 19:55:34 GMT Subject: RFR: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 Message-ID: This PR removes the recently added lock around set_guard_value, using instead Atomic::cmpxchg to atomically update bit-fields of the guard value. Further, it takes a fast-path that uses the previous direct store when at a safepoint. Combined, these changes should get us back to almost where we were before in terms of overhead. If necessary, we could go even further and allow make_not_entrant() do a direct byte store, leaving 24 bits for the guard value. ------------- Commit messages: - trailing white space - fix s390 copy-paste - use fast path for safepoints - ppc align check - ppc align - simplify x86 - In CAS loop, update old_value from result of CAS - fix ppc build - fix zero build - ppc typos - ... and 3 more: https://git.openjdk.org/jdk/compare/cf75f1f9...ecc6e68e Changes: https://git.openjdk.org/jdk/pull/26399/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26399&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361376 Stats: 214 lines in 13 files changed: 118 ins; 58 del; 38 mod Patch: https://git.openjdk.org/jdk/pull/26399.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26399/head:pull/26399 PR: https://git.openjdk.org/jdk/pull/26399 From dlong at openjdk.org Wed Jul 23 21:42:54 2025 From: dlong at openjdk.org (Dean Long) Date: Wed, 23 Jul 2025 21:42:54 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v2] In-Reply-To: References: Message-ID: <0hUerokqC_F8hnTVSlcSO7eRDfuq_JhFOMZJGtnVJLg=.85ed9e6e-9fec-4855-a4e0-53748a7b42fa@github.com> On Fri, 4 Jul 2025 02:33:33 GMT, Dean Long wrote: >> The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > fix optimized build @eme64 and @mhaessig, this is somewhat a followup to 8336906, so you may be the best candidates to volunteer to look at this :-) ------------- PR Comment: https://git.openjdk.org/jdk/pull/26121#issuecomment-3110272992 From duke at openjdk.org Wed Jul 23 23:34:10 2025 From: duke at openjdk.org (duke) Date: Wed, 23 Jul 2025 23:34:10 GMT Subject: Withdrawn: 8342095: Add autovectorizer support for subword vector casts In-Reply-To: References: Message-ID: On Mon, 3 Feb 2025 04:40:23 GMT, Jasmine Karthikeyan wrote: > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/23413 From duke at openjdk.org Wed Jul 23 23:49:10 2025 From: duke at openjdk.org (duke) Date: Wed, 23 Jul 2025 23:49:10 GMT Subject: Withdrawn: 8341697: C2: Register allocation inefficiency in tight loop In-Reply-To: References: Message-ID: On Fri, 11 Oct 2024 15:50:20 GMT, Quan Anh Mai wrote: > Hi, > > This patch improves the spill placement in the presence of loops. Currently, when trying to spill a live range, we will create a `Phi` at the loop head, this `Phi` will then be spilt inside the loop body, and as the `Phi` is `UP` (lives in register) at the loop head, we need to emit an additional reload at the loop back-edge block. This introduces loop-carried dependencies, greatly reduces loop throughput. > > My proposal is to be aware of loop heads and try to eagerly spill or reload live ranges at the loop entries. In general, if a live range is spilt in the loop common path, then we should spill it in the loop entries and reload it at its use sites, this may increase the number of loads but will eliminate loop-carried dependencies, making the load latency-free. On the otherhand, if a live range is only spilt in the uncommon path but is used in the common path, then we should reload it eagerly. I think it is appropriate to bias towards spilling, i.e. if a live range is both spilt and reloaded in the common path, we spill it. This eliminates loop-carried dependencies. > > A downfall of this algorithm is that we may overspill, which means that after spilling some live ranges, the others do not need to be spilt anymore but are unnecessarily spilt. > > - A possible approach is to split the live ranges one-by-one and try to colour them afterwards. This seems prohibitively expensive. > - Another approach is to be aware of the number of registers that need spilling, sorting the live ones accordingly. > - Finally, we can eagerly split a live range at uncommon branches and do conservative coalescing afterwards. I think this is the most elegant and efficient solution for that. > > Please take a look and leave your reviews, thanks a lot. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/21472 From jbhateja at openjdk.org Thu Jul 24 00:41:02 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 24 Jul 2025 00:41:02 GMT Subject: RFR: 8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs [v2] In-Reply-To: References: <_G3VGE-OBobi6zHUwA3452t_6Z5O_ojTPI_t8Fdm__M=.097051f1-0478-447d-a46b-b6e6d6cd25e1@github.com> Message-ID: <1lFbFokLiW3pWxrvq8WtLoiXj-TkYFq2xk-cBtgHvhI=.ab2f363a-60a9-4d9b-aca4-82c770eb1cb2@github.com> On Mon, 21 Jul 2025 15:44:47 GMT, Jatin Bhateja wrote: >> Hi Jatin (@jatin-bhateja), for the first iteration, would it be ok to get the push_paired/pop_paired changes integrated and then make the push2p/pop2p related optimizations in a separate PR? >> >> Thanks, >> Vamsi > > Hi @vamsi-parasa , I think it's ok not to expose pop_ppx / push_ppx as separate interfaces, and let processor forward the values b/w push and matching pop if balancing constraints are satisfied. > > image > Hi Jatin (@jatin-bhateja), the reason to make the push_ppx/pop_ppx usage explicit is because an unbalanced push_ppx operation has a performance penalty. Thanks @vamsi-parasa , as per APX specification PPX is an optimization hint and should only improve performance if balancing contraintins are met. so I don't think it will have any performance penalty. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25889#discussion_r2227032048 From aoqi at openjdk.org Thu Jul 24 01:16:53 2025 From: aoqi at openjdk.org (Ao Qi) Date: Thu, 24 Jul 2025 01:16:53 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 [v2] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 17:37:10 GMT, Vladimir Kozlov wrote: >> Ao Qi has updated the pull request incrementally with one additional commit since the last revision: >> >> missing macros for is_on_for_use() > > In Leyden premain branch we already have this. I forgot to port it to mainline. Thanks for the review, @vnkozlov and @shipilev . ------------- PR Comment: https://git.openjdk.org/jdk/pull/26436#issuecomment-3111628906 From duke at openjdk.org Thu Jul 24 01:16:54 2025 From: duke at openjdk.org (duke) Date: Thu, 24 Jul 2025 01:16:54 GMT Subject: RFR: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 [v2] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 06:33:39 GMT, Ao Qi wrote: >> Configure with `--with-jvm-variants=minimal --with-debug-level=slowdebug`. >> >> Error message: >> >> ... >> Compiling up to 136 files for BUILD_java.compiler.interim >> Creating support/modules_libs/java.base/minimal/libjvm.so from 628 file(s) >> /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/hotspot/variant-minimal/libjvm/objs/macroAssembler_x86.o: in function `AOTCodeCache::is_on_for_dump()': >> /home/aoqi/work/openjdk/jdk/src/hotspot/share/code/aotCodeCache.hpp:383: undefined reference to `AOTCodeCache::_cache' >> collect2: error: ld returned 1 exit status >> gmake[3]: *** [lib/CompileJvm.gmk:175: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/support/modules_libs/java.base/minimal/libjvm.so] Error 1 >> gmake[2]: *** [make/Main.gmk:242: hotspot-minimal-libs] Error 1 >> gmake[2]: *** Waiting for unfinished jobs.... >> >> >> `AOTCodeCache::is_on_for_dump()` is used in `macroAssembler_x86.cpp` but not defined when cds is disabled. > > Ao Qi has updated the pull request incrementally with one additional commit since the last revision: > > missing macros for is_on_for_use() @theaoqi Your change (at version dc329c366d9d76f5d123effe5530d490710df5e7) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26436#issuecomment-3111630759 From duke at openjdk.org Thu Jul 24 01:20:46 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 01:20:46 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v40] In-Reply-To: References: Message-ID: > This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). > > When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. > > This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality > > Additional Testing: > - [ ] Linux x64 fastdebug all > - [ ] Linux aarch64 fastdebug all > - [ ] ... Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: Use CompiledICLocker instead of CompiledIC_lock ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23573/files - new: https://git.openjdk.org/jdk/pull/23573/files/1b001df8..d4e3dd31 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=39 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23573&range=38-39 Stats: 19 lines in 3 files changed: 9 ins; 8 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/23573.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23573/head:pull/23573 PR: https://git.openjdk.org/jdk/pull/23573 From dzhang at openjdk.org Thu Jul 24 01:34:53 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 24 Jul 2025 01:34:53 GMT Subject: RFR: 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 03:32:26 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java fails without RVV after [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) in fastdebug mode. > > In [JDK-8291669](https://bugs.openjdk.org/browse/JDK-8291669), which introduced this case, it is mentioned: >>Previously attached jtreg case fails on ppc64 because VectorAPI has no >>vector intrinsics on ppc64 so there's no long range check to hoist. In >>this patch, we limit the test architecture to x64 and AArch64. > > So we need RVV to use vector intrinsics on RISC-V. > > ### Test (fastdebug) > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on k1 and sg2042 > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on qemu-system w/ and w/o RVV Thanks all for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26437#issuecomment-3111652273 From duke at openjdk.org Thu Jul 24 01:34:53 2025 From: duke at openjdk.org (duke) Date: Thu, 24 Jul 2025 01:34:53 GMT Subject: RFR: 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 03:32:26 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java fails without RVV after [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) in fastdebug mode. > > In [JDK-8291669](https://bugs.openjdk.org/browse/JDK-8291669), which introduced this case, it is mentioned: >>Previously attached jtreg case fails on ppc64 because VectorAPI has no >>vector intrinsics on ppc64 so there's no long range check to hoist. In >>this patch, we limit the test architecture to x64 and AArch64. > > So we need RVV to use vector intrinsics on RISC-V. > > ### Test (fastdebug) > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on k1 and sg2042 > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on qemu-system w/ and w/o RVV @DingliZhang Your change (at version f0f92003c0afd2da54e33ebbb1ca65a596fc056f) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26437#issuecomment-3111653889 From aoqi at openjdk.org Thu Jul 24 01:36:06 2025 From: aoqi at openjdk.org (Ao Qi) Date: Thu, 24 Jul 2025 01:36:06 GMT Subject: Integrated: 8363895: Minimal build fails with slowdebug builds after JDK-8354887 In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 02:16:46 GMT, Ao Qi wrote: > Configure with `--with-jvm-variants=minimal --with-debug-level=slowdebug`. > > Error message: > > ... > Compiling up to 136 files for BUILD_java.compiler.interim > Creating support/modules_libs/java.base/minimal/libjvm.so from 628 file(s) > /opt/rh/devtoolset-11/root/usr/libexec/gcc/x86_64-redhat-linux/11/ld: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/hotspot/variant-minimal/libjvm/objs/macroAssembler_x86.o: in function `AOTCodeCache::is_on_for_dump()': > /home/aoqi/work/openjdk/jdk/src/hotspot/share/code/aotCodeCache.hpp:383: undefined reference to `AOTCodeCache::_cache' > collect2: error: ld returned 1 exit status > gmake[3]: *** [lib/CompileJvm.gmk:175: /home/aoqi/work/openjdk/jdk/build/linux-x86_64-minimal-slowdebug/support/modules_libs/java.base/minimal/libjvm.so] Error 1 > gmake[2]: *** [make/Main.gmk:242: hotspot-minimal-libs] Error 1 > gmake[2]: *** Waiting for unfinished jobs.... > > > `AOTCodeCache::is_on_for_dump()` is used in `macroAssembler_x86.cpp` but not defined when cds is disabled. This pull request has now been integrated. Changeset: 2da0cdad Author: Ao Qi Committer: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/2da0cdadb898efb9af827374368471102bfe0ccd Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8363895: Minimal build fails with slowdebug builds after JDK-8354887 Reviewed-by: kvn, shade ------------- PR: https://git.openjdk.org/jdk/pull/26436 From dzhang at openjdk.org Thu Jul 24 01:40:07 2025 From: dzhang at openjdk.org (Dingli Zhang) Date: Thu, 24 Jul 2025 01:40:07 GMT Subject: Integrated: 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV In-Reply-To: References: Message-ID: <3E1vKnICiR96AWrpvco-qR4t6nEdx4raClkTKrdZGG8=.2dd135d7-0886-43cc-902b-93508c47527a@github.com> On Wed, 23 Jul 2025 03:32:26 GMT, Dingli Zhang wrote: > Hi all, > Please take a look and review this PR, thanks! > > test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java fails without RVV after [JDK-8355293](https://bugs.openjdk.org/browse/JDK-8355293) in fastdebug mode. > > In [JDK-8291669](https://bugs.openjdk.org/browse/JDK-8291669), which introduced this case, it is mentioned: >>Previously attached jtreg case fails on ppc64 because VectorAPI has no >>vector intrinsics on ppc64 so there's no long range check to hoist. In >>this patch, we limit the test architecture to x64 and AArch64. > > So we need RVV to use vector intrinsics on RISC-V. > > ### Test (fastdebug) > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on k1 and sg2042 > - [x] Run test/hotspot/jtreg/compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on qemu-system w/ and w/o RVV This pull request has now been integrated. Changeset: b746701e Author: Dingli Zhang Committer: Fei Yang URL: https://git.openjdk.org/jdk/commit/b746701e5769a7a5a1e7900ddfdd285706ac5fe1 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8363898: RISC-V: TestRangeCheckHoistingScaledIV.java fails after JDK-8355293 when running without RVV Reviewed-by: fyang, mli, syan ------------- PR: https://git.openjdk.org/jdk/pull/26437 From qxing at openjdk.org Thu Jul 24 02:13:02 2025 From: qxing at openjdk.org (Qizheng Xing) Date: Thu, 24 Jul 2025 02:13:02 GMT Subject: RFR: 8360192: C2: Make the type of count leading/trailing zero nodes more precise [v5] In-Reply-To: References: Message-ID: <2oRnOjBC-_tOqT53pt6ozG3ENpv7CLsA4HBt5RqP3PY=.ec27d42f-e41e-4a55-91c5-977c2211c666@github.com> On Wed, 23 Jul 2025 09:31:18 GMT, Qizheng Xing wrote: >> The result of count leading/trailing zeros is always non-negative, and the maximum value is integer type's size in bits. In previous versions, when C2 can not know the operand value of a CLZ/CTZ node at compile time, it will generate a full-width integer type for its result. This can significantly affect the efficiency of code in some cases. >> >> This patch makes the type of CLZ/CTZ nodes more precise, to make C2 generate better code. For example, the following implementation runs ~115% faster on x86-64 with this patch: >> >> >> public static int numberOfNibbles(int i) { >> int mag = Integer.SIZE - Integer.numberOfLeadingZeros(i); >> return Math.max((mag + 3) / 4, 1); >> } >> >> >> Testing: tier1, IR test > > Qizheng Xing has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Merge branch 'master' into enhance-clz-type > - Move `TestCountBitsRange` to `compiler.c2.gvn` > - Fix null checks > - Narrow type bound > - Use `BitsPerX` constant instead of `sizeof` > - Make the type of count leading/trailing zero nodes more precise Hi all, This patch has now passed all GHA tests and is ready for further reviews. If there are any other suggestions for this PR, please let me know. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25928#issuecomment-3111718442 From fjiang at openjdk.org Thu Jul 24 02:24:58 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 24 Jul 2025 02:24:58 GMT Subject: RFR: 8362838: RISC-V: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 02:22:49 GMT, Fei Yang wrote: >> Same as [JDK-8361892](https://bugs.openjdk.org/browse/JDK-8361892), but for riscv. >> >> Testing: >> - [x] Tier1-3 & hotspot:tier4 on linux-riscv64 > > Look good to me. Thanks for fixing this. @RealFYang @yadongw -- Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26318#issuecomment-3111737114 From fjiang at openjdk.org Thu Jul 24 02:24:59 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 24 Jul 2025 02:24:59 GMT Subject: Integrated: 8362838: RISC-V: Incorrect matching rule leading to improper oop instruction encoding In-Reply-To: References: Message-ID: <_HGmi-o7Hq1tbGCGLEODQbkx78no47FeMUI7GHEslOg=.489223b2-4211-4e0c-af7a-ef286056e8e4@github.com> On Tue, 15 Jul 2025 14:02:08 GMT, Feilong Jiang wrote: > Same as [JDK-8361892](https://bugs.openjdk.org/browse/JDK-8361892), but for riscv. > > Testing: > - [x] Tier1-3 & hotspot:tier4 on linux-riscv64 This pull request has now been integrated. Changeset: 0ba2942c Author: Feilong Jiang URL: https://git.openjdk.org/jdk/commit/0ba2942c6e7aadc3d091c40f6bd8d9f7502f5f76 Stats: 31 lines in 1 file changed: 0 ins; 31 del; 0 mod 8362838: RISC-V: Incorrect matching rule leading to improper oop instruction encoding Reviewed-by: fyang, yadongwang ------------- PR: https://git.openjdk.org/jdk/pull/26318 From duke at openjdk.org Thu Jul 24 02:56:43 2025 From: duke at openjdk.org (erifan) Date: Thu, 24 Jul 2025 02:56:43 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v5] In-Reply-To: References: Message-ID: > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is > relative smaller than that of `fromLong`. So this patch does the conversion for these cases. > > The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. > > Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. > > This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. > > As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like > > VectorMaskToLong (VectorLongToMask x) => x > > > Hence, this patch also added the following optimizations: > > VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > > VectorMaskCast (VectorMaskCast x) => x > > And we can see noticeable performance improvement with the above optimizations for floating-point types. > > Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 > microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 > microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 > microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 > microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 > microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 > microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 > microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 > > > Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double... erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Add JMH benchmarks for cast chain transformation - Merge branch 'master' into JDK-8356760 - Refactor the implementation Do the convertion in C2's IGVN phase to cover more cases. - Merge branch 'master' into JDK-8356760 - Simplify the test code - Address some review comments Add support for the following patterns: toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) toLong(maskAll(false)) => 0 And add more test cases. - Merge branch 'master' into JDK-8356760 - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is relative smaller than that of `fromLong`. This patch does the conversion for these cases if `l` is a compile time constant. And this conversion also enables further optimizations that recognize maskAll patterns, see [1]. Some JTReg test cases are added to ensure the optimization is effective. I tried many different ways to write a JMH benchmark, but failed. Since the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific compile-time constant, the statement will be hoisted out of the loop. If we don't use a loop, the hotspot will become other instructions, and no obvious performance change was observed. However, combined with the optimization of [1], we can observe a performance improvement of about 7% on both aarch64 and x64. The patch was tested on both aarch64 and x64, all of tier1 tier2 and tier3 tests passed. [1] https://github.com/openjdk/jdk/pull/24674 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25793/files - new: https://git.openjdk.org/jdk/pull/25793/files/8ebe5e56..6ae43e17 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=03-04 Stats: 11783 lines in 334 files changed: 9323 ins; 800 del; 1660 mod Patch: https://git.openjdk.org/jdk/pull/25793.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793 PR: https://git.openjdk.org/jdk/pull/25793 From duke at openjdk.org Thu Jul 24 03:41:54 2025 From: duke at openjdk.org (erifan) Date: Thu, 24 Jul 2025 03:41:54 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v3] In-Reply-To: <7zxjUTJq9ynYRau4UpWaFcARH8cp8Xka3cJovCwGVRY=.2bcd9dc6-a9df-47f2-8834-bc6c4a8469cf@github.com> References: <3cr8Njt2flaQXy5sjOVOlhI9XDkEesagnYLwzCmgkoI=.089494aa-d622-47db-8d23-c9637519028c@github.com> <7zxjUTJq9ynYRau4UpWaFcARH8cp8Xka3cJovCwGVRY=.2bcd9dc6-a9df-47f2-8834-bc6c4a8469cf@github.com> Message-ID: On Fri, 18 Jul 2025 03:14:58 GMT, Jatin Bhateja wrote: >>> > > > public static final VectorSpecies FSP = FloatVector.SPECIES_512; >>> > > > public static long micro1(long a) { >>> > > > long mask = Math.min(-1, Math.max(-1, a)); >>> > > > return VectorMask.fromLong(FSP, mask).toLong(); >>> > > > } >>> > > > public static long micro2() { >>> > > > return FSP.maskAll(true).toLong(); >>> > > > } >>> > > >>> > > >>> > > With this JMH method we can not see obvious performance improvement, because the hot spots are other instructions. Adding a loop is better. >>> > >>> > >>> > There is no hard and fast rule for the inclusion of a loop in a JMH micro in that case? >>> >>> You mean adding a loop is not a block, right ? >> >> Yes. If you see gains without loop go for it. > >> As @jatin-bhateja suggested, I have refactored the implementation and updated the commit message. please help review this PR, thanks! > > Thanks a lot @erifan , I am out for the rest of the week, will re-review early next week. @jatin-bhateja I have addressed your comments, would you mind take another look, thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3111848754 From duke at openjdk.org Thu Jul 24 03:41:55 2025 From: duke at openjdk.org (erifan) Date: Thu, 24 Jul 2025 03:41:55 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: References: <-ZNeXOcmEACkhP4QKXKnWWEiT6ucjPY7Zz1HqvMeAoI=.c8fae49e-fcb0-41fb-84d1-4aa52ee83790@github.com> <1RlpmwLAF5ALeZQRS_DAqixgD6MUno5cUbguqHTlUU0=.6a594ecc-b8f7-489d-b801-a41e87d1deeb@github.com> Message-ID: On Tue, 22 Jul 2025 09:04:56 GMT, erifan wrote: >>> Do you mean this check `Matcher::match_rule_supported_vector(opc, vlen, maskall_bt)` ? I think it's necessary ? Because in theory some platforms don't support both `MaskAll` and `Replicate`. Of course, this situation may not exist in reality. If `MaskAll` and `Replicate` are not supported, then `VectorLongToMask` should not be supported either, and this function will not be called. >> >> My suggestion was to check for Op_Replicate here as Op_MaskAll is already checked underneath VectorNode::scalar2vector under an assumption that MaskAll is a special case for replicate applicable to masks > > Oh I misunderstood what you meant, now I understand, thank you! Done, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2227253339 From duke at openjdk.org Thu Jul 24 03:41:56 2025 From: duke at openjdk.org (erifan) Date: Thu, 24 Jul 2025 03:41:56 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4] In-Reply-To: <-ZNeXOcmEACkhP4QKXKnWWEiT6ucjPY7Zz1HqvMeAoI=.c8fae49e-fcb0-41fb-84d1-4aa52ee83790@github.com> References: <-ZNeXOcmEACkhP4QKXKnWWEiT6ucjPY7Zz1HqvMeAoI=.c8fae49e-fcb0-41fb-84d1-4aa52ee83790@github.com> Message-ID: On Tue, 22 Jul 2025 03:18:19 GMT, erifan wrote: >> test/micro/org/openjdk/bench/jdk/incubator/vector/MaskFromLongToLongBenchmark.java line 34: >> >>> 32: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"}) >>> 33: public class MaskFromLongToLongBenchmark { >>> 34: private static final int ITERATION = 10000; >> >> It will be nice to add a synthetic micro for cast chain transform added along with this patch. following micro shows around 1.5x gains on AVX2 system because of widening cast elision. >> >> >> import jdk.incubator.vector.*; >> import java.util.stream.IntStream; >> >> public class mask_cast_chain { >> public static final VectorSpecies FSP = FloatVector.SPECIES_128; >> >> public static long micro(float [] src1, float [] src2, int ctr) { >> long res = 0; >> for (int i = 0; i < FSP.loopBound(src1.length); i += FSP.length()) { >> res += FloatVector.fromArray(FSP, src1, i) >> .compare(VectorOperators.GE, FloatVector.fromArray(FSP, src2, i)) >> .cast(DoubleVector.SPECIES_256) >> .cast(FloatVector.SPECIES_128) >> .toLong(); >> } >> return res * ctr; >> } >> >> public static void main(String [] args) { >> float [] src1 = new float[1024]; >> float [] src2 = new float[1024]; >> >> IntStream.range(0, src1.length).forEach(i -> {src1[i] = (float)i;}); >> IntStream.range(0, src2.length).forEach(i -> {src2[i] = (float)500;}); >> >> long res = 0; >> for (int i = 0; i < 100000; i++) { >> res += micro(src1, src2, i); >> } >> long t1 = System.currentTimeMillis(); >> for (int i = 0; i < 100000; i++) { >> res += micro(src1, src2, i); >> } >> long t2 = System.currentTimeMillis(); >> System.out.println("[time] " + (t2 - t1) + "ms" + " [res] " + res); >> } >> } > > Ok~ Added some JMH benchmarks, the code is slightly different with your code. Test results show that on my avx2 system, there are ~17% performance improvement for applicable cases. No performance change on avx3 system because `cast` is lowered as empty. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2227250664 From duke at openjdk.org Thu Jul 24 03:45:00 2025 From: duke at openjdk.org (erifan) Date: Thu, 24 Jul 2025 03:45:00 GMT Subject: RFR: 8354242: VectorAPI: combine vector not operation with compare [v7] In-Reply-To: <15TW6hiffz65NhHevPefL_6swSC07UD-GwiJ4tPDtFs=.b83081df-8abd-4756-b4e0-1d969678a0d2@github.com> References: <15TW6hiffz65NhHevPefL_6swSC07UD-GwiJ4tPDtFs=.b83081df-8abd-4756-b4e0-1d969678a0d2@github.com> Message-ID: On Thu, 5 Jun 2025 11:05:48 GMT, Emanuel Peter wrote: >>> > FYI: `BoolTest::negate` already does what you want: `mask negate( ) const { return mask(_test^4); }` I think you should use that instead :) >>> >>> Indeed, I hadn't noticed that, thank you. >> >> Oh I think we still cannot use `BoolTest::negate`, because we cannot instantiate a `BoolTest` object with **unsigned** comparison. `BoolTest::negate` is a non-static function. > >> Oh I think we still cannot use `BoolTest::negate`, because we cannot instantiate a `BoolTest` object with **unsigned** comparison. `BoolTest::negate` is a non-static function. > > I see. Ok. Hmm. I still think that the logic should be in `BoolTest`, because that is where the exact implementation of the enum values is. In that context it is easier to see why `^4` does the negation. And imagine we were ever to change the enum values, then it would be harder to find your code and fix it. > > Maybe it could be called `BoolTest::negate_mask(mast btm)` and explain in a comment that both signed and unsigned is supported. Hi @eme64 @jatin-bhateja , could you please take a look at this patch? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/24674#issuecomment-3111851722 From galder at openjdk.org Thu Jul 24 06:51:36 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 24 Jul 2025 06:51:36 GMT Subject: RFR: 8354244: Use random data in MinMaxRed_Long data arrays Message-ID: Simplified the data used in the tests added in [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513). The data does not need to have a specific shape because this test focuses on verifying the IR when vectorization kicks in, and when it does, the data can just be random. Shaping the data to control branch taken/not-taken paths makes sense when CMov macro expansion kicks in instead of vectorization. When switching to random data I noticed that the test was randomly failing. This was due to potential overflows that result from takin the min/max and then multiplying it by 11, so I've adjusted that section of the test as well. I've run the test on both aarch64 and x64 platforms where this test would get vectorized. To verify that I made sure the test passed and verified that the jtr output to make sure the IR conditions were matched. ------------- Commit messages: - Simplify test Changes: https://git.openjdk.org/jdk/pull/26451/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26451&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8354244 Stats: 84 lines in 1 file changed: 11 ins; 62 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/26451.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26451/head:pull/26451 PR: https://git.openjdk.org/jdk/pull/26451 From dfenacci at openjdk.org Thu Jul 24 07:12:55 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Thu, 24 Jul 2025 07:12:55 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v4] In-Reply-To: <_preMnRE0tqL476Pb8bPPfkixInRa-ZH5Qom7W70AW4=.a71e36da-d0e1-44e4-a3fe-9091460b813f@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> <_preMnRE0tqL476Pb8bPPfkixInRa-ZH5Qom7W70AW4=.a71e36da-d0e1-44e4-a3fe-9091460b813f@github.com> Message-ID: <51aYnCiXel-vz4Zu40K08E1lyBtX5JXD8PXoCr5wWUE=.15def8e4-f7c3-42ae-976e-f79ed7415bfa@github.com> On Wed, 23 Jul 2025 15:18:39 GMT, Saranya Natarajan wrote: >> **Issue** >> Extreme values for BciProfileWidth flag such as `java -XX:BciProfileWidth=-1 -version` and `java -XX:BciProfileWidth=100000 -version `results in assert failure `assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. This is observed in a x86 machine. >> >> **Analysis** >> On debugging the issue, I found that increasing the size of the interpreter using the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` prevented the above mentioned assert from failing for large values of BciProfileWidth. >> >> **Proposal** >> Considering the fact that larger BciProfileWidth results in slower profiling, I have proposed a range between 0 to 5000 to restrict the value for BciProfileWidth for x86 machines. This maximum value is based on modifying the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` using the smallest `InterpreterCodeSize` for all the architectures. As for the lower bound, a value of -1 would be the same as 0, as this simply means no return bci's will be recorded in ret profile. >> >> **Issue in AArch64** >> Additionally running the command `java -XX:BciProfileWidth= 10000 -version` (or larger values) results in a different failure `assert(offset_ok_for_immed(offset(), size)) failed: must be, was: 32768, 3` on an AArch64 machine.This is an issue of maximum offset for `ldr/str` in AArch64 which can be fixed using `form_address` as mentioned in [JDK-8342736](https://bugs.openjdk.org/browse/JDK-8342736). In my preliminary fix using `form_address` on AArch64 machine. I had to modify 3 `ldr` and 1 `str` instruction (in file `src/hotspot/cpu/aarch64/interp_masm_aarch64.cpp` at line number 926, 983, and 997). With this fix using `form_address`, `BciProfileWidth` works for maximum of 5000 after which it crashes with`assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. Without this fix `BciProfileWidth` works for a maximum value of 1300. Currently, I have suggested to restrict the upper bound on AArch64 to 1000 instead of fixing it with `form_address`. >> >> **Question to reviewers** >> Do you think this is a reasonable fix ? For AArch64 do you suggest fixing using `form_address` ? If yes, do I fix it under this PR or create another one ? > > Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: > > fixing copyright Thanks for fixing this @sarannat. I left a couple of inline comments. src/hotspot/share/runtime/globals.hpp line 1356: > 1354: develop(int, BciProfileWidth, 2, \ > 1355: "Number of return bci's to record in ret profile") \ > 1356: range(0, AARCH64_ONLY(1000) NOT_AARCH64(5000)) \ I'm not too sure of the usual number of returns but even just 1000 sounds quite big as maximum. Do you think we could use that for all architectures? test/lib-test/jdk/test/whitebox/vm_flags/IntxTest.java line 39: > 37: public class IntxTest { > 38: private static final String FLAG_NAME = "OnStackReplacePercentage"; > 39: private static final String FLAG_DEBUG_NAME = "BciProfileWidth"; Maybe we might want use another `intx` flag instead of just removing this (just to keep testing the WhiteBox) ------------- PR Review: https://git.openjdk.org/jdk/pull/26139#pullrequestreview-3050359574 PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2227627961 PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2227611611 From bkilambi at openjdk.org Thu Jul 24 07:40:02 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 24 Jul 2025 07:40:02 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 09:57:30 GMT, erifan wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments to half the number of match rules > > src/hotspot/cpu/aarch64/aarch64.ad line 923: > >> 921: V24, V24_H, V24_J, V24_K >> 922: ); >> 923: > > Not a big matter, but it looks better to me if you can move this change `after line 810` of this file. Thanks but I feel having all the vector classes (like for vecA, vecX etc) together would be better and keeping the reg_class definitions with other reg_class feels better to me. Hope that's ok? > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 5181: > >> 5179: %}')dnl >> 5180: dnl >> 5181: > > Remove this blank otherwise two blank lines will be generated. See `src/hotspot/cpu/aarch64/aarch64_vector.ad` line 7180 and line 7181 Hi @erifan Thanks for the comment. This is a good catch. Will update patch soon. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2227688969 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2227686648 From bkilambi at openjdk.org Thu Jul 24 07:44:09 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 24 Jul 2025 07:44:09 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: <0kEjeM7HkOm_kbmaY13VChiaRlLLQH9KCh45Hw9B2us=.abf6e115-5ce7-4a72-bbdf-ec958aca7cc3@github.com> On Tue, 22 Jul 2025 10:01:02 GMT, erifan wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Addressed review comments to half the number of match rules > > src/hotspot/cpu/aarch64/aarch64.ad line 5091: > >> 5089: format %{ %} >> 5090: interface(REG_INTER); >> 5091: %} > > Ditto, I tend to moving this change `after line 5101` of this file. Same explanation here. the registers are matching `vReg` which can be either `vecA`, `vecD` or `vecX` and thus I placed the operand definitions right after the `vReg` definition. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2227696114 From duke at openjdk.org Thu Jul 24 08:18:02 2025 From: duke at openjdk.org (erifan) Date: Thu, 24 Jul 2025 08:18:02 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v15] In-Reply-To: References: Message-ID: <4WxJs9VADTHxqICEzhByWO1YAbt5AGsfIXsAI3BFZfU=.7c75ecea-d5aa-4a58-8862-5aa771469201@github.com> On Thu, 24 Jul 2025 07:37:46 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 923: >> >>> 921: V24, V24_H, V24_J, V24_K >>> 922: ); >>> 923: >> >> Not a big matter, but it looks better to me if you can move this change `after line 810` of this file. > > Thanks but I feel having all the vector classes (like for vecA, vecX etc) together would be better and keeping the reg_class definitions with other reg_class feels better to me. Hope that's ok? ACK >> src/hotspot/cpu/aarch64/aarch64.ad line 5091: >> >>> 5089: format %{ %} >>> 5090: interface(REG_INTER); >>> 5091: %} >> >> Ditto, I tend to moving this change `after line 5101` of this file. > > Same explanation here. the registers are matching `vReg` which can be either `vecA`, `vecD` or `vecX` and thus I placed the operand definitions right after the `vReg` definition. ACK ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2227776401 PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2227777777 From chagedorn at openjdk.org Thu Jul 24 08:29:07 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 24 Jul 2025 08:29:07 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? In-Reply-To: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Tue, 22 Jul 2025 08:25:08 GMT, Roland Westrelin wrote: > A node in a pre loop only has uses out of the loop dominated by the > loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control > to the loop exit projection. A range check in the main loop has this > node as input (through a chain of some other nodes). Range check > elimination needs to update the exit condition of the pre loop with an > expression that depends on the node pinned on its exit: that's > impossible and the assert fires. This is a variant of 8314024 (this > one was for a node with uses out of the pre loop on multiple paths). I > propose the same fix: leave the node with control in the pre loop in > this case. Some small comments, otherwise, looks good! I launched some testing. Somehow forgot to submit my review from 2 days ago... src/hotspot/share/opto/loopopts.cpp line 1929: > 1927: // Sinking a node from a pre loop to its main loop pins the node between the pre and main loops. If that node is input > 1928: // to a check that's eliminated by range check elimination, it becomes input to an expression that feeds into the exit > 1929: // test of the pre loop above the point in the graph where it's pinned. I suggest to move it up as a method comment: Suggestion: // Sinking a node from a pre loop to its main loop pins the node between the pre and main loops. If that node is input // to a check that's eliminated by range check elimination, it becomes input to an expression that feeds into the exit // test of the pre loop above the point in the graph where it's pinned. bool PhaseIdealLoop::would_sink_below_pre_loop_exit(IdealLoopTree* n_loop, Node* ctrl) { test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java line 30: > 28: * > 29: * @run main/othervm -XX:CompileCommand=compileonly,*TestSunkRangeFromPreLoopRCE2*::* -Xbatch TestSunkRangeFromPreLoopRCE2 > 30: * Suggestion: test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java line 63: > 61: } > 62: } > 63: Suggestion: test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java line 31: > 29: * > 30: * @run main/othervm -XX:-BackgroundCompilation -XX:LoopUnrollLimit=100 -XX:-UseLoopPredicate -XX:-UseProfiledLoopPredicate TestSunkRangeFromPreLoopRCE3 > 31: * Suggestion: ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26424#pullrequestreview-3042110108 PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2221904998 PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2221907804 PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2221901405 PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2221909155 From chagedorn at openjdk.org Thu Jul 24 08:39:01 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 24 Jul 2025 08:39:01 GMT Subject: RFR: 8354244: Use random data in MinMaxRed_Long data arrays In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 06:45:59 GMT, Galder Zamarre?o wrote: > Simplified the data used in the tests added in [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513). The data does not need to have a specific shape because this test focuses on verifying the IR when vectorization kicks in, and when it does, the data can just be random. Shaping the data to control branch taken/not-taken paths makes sense when CMov macro expansion kicks in instead of vectorization. > > When switching to random data I noticed that the test was randomly failing. This was due to potential overflows that result from takin the min/max and then multiplying it by 11, so I've adjusted that section of the test as well. > > I've run the test on both aarch64 and x64 platforms where this test would get vectorized. To verify that I made sure the test passed and verified that the jtr output to make sure the IR conditions were matched. Nice clean-up by using the `Generators`. Looks good to me! Let me submit some testing with the updated test only. ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26451#pullrequestreview-3050667024 From roland at openjdk.org Thu Jul 24 08:41:52 2025 From: roland at openjdk.org (Roland Westrelin) Date: Thu, 24 Jul 2025 08:41:52 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v2] In-Reply-To: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: > A node in a pre loop only has uses out of the loop dominated by the > loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control > to the loop exit projection. A range check in the main loop has this > node as input (through a chain of some other nodes). Range check > elimination needs to update the exit condition of the pre loop with an > expression that depends on the node pinned on its exit: that's > impossible and the assert fires. This is a variant of 8314024 (this > one was for a node with uses out of the pre loop on multiple paths). I > propose the same fix: leave the node with control in the pre loop in > this case. Roland Westrelin has updated the pull request incrementally with four additional commits since the last revision: - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java Co-authored-by: Christian Hagedorn - Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Christian Hagedorn - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26424/files - new: https://git.openjdk.org/jdk/pull/26424/files/8abae076..2140c98d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=00-01 Stats: 9 lines in 3 files changed: 3 ins; 6 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26424.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26424/head:pull/26424 PR: https://git.openjdk.org/jdk/pull/26424 From aph at openjdk.org Thu Jul 24 08:42:17 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 24 Jul 2025 08:42:17 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v16] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 11:09:04 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in c2_MacroAssembler_aarch64.cpp src/hotspot/cpu/aarch64/aarch64_vector.ad line 261: > 259: > 260: // Because the SVE2 "tbl" instruction is unpredicated and partial operations cannot be generated > 261: // using masks, we currently disable this operation on machines where length_in_bytes < Suggestion: // using masks, we disable this operation on machines where length_in_bytes < ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2227850978 From aph at openjdk.org Thu Jul 24 08:45:09 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 24 Jul 2025 08:45:09 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v16] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 11:09:04 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in c2_MacroAssembler_aarch64.cpp src/hotspot/cpu/aarch64/aarch64_vector.ad line 256: > 254: // the default VectorRearrange + VectorBlend is generated as the performance of the default > 255: // implementation was slightly better/similar than the implementation for SelectFromTwoVector. > 256: if (UseSVE < 2 && (type2aelembytes(bt) == 8 || length_in_bytes > 16)) { Suggestion: // This operation is disabled for doubles and longs on machines with SVE < 2 and instead // the default VectorRearrange + VectorBlend is generated because the performance of the default // implementation was better than or equal to the implementation for SelectFromTwoVector. if (UseSVE < 2 && (type2aelembytes(bt) == 8 || length_in_bytes > 16)) { ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2227858836 From thartmann at openjdk.org Thu Jul 24 09:03:05 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 24 Jul 2025 09:03:05 GMT Subject: RFR: 8363357: Remove unused flag VerifyAdapterCalls In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 07:46:58 GMT, Marc Chevalier wrote: > It seems that the flag VerifyAdapterCalls is unused since [JDK-8350209](https://bugs.openjdk.org/browse/JDK-8350209), so pretty recently. > > Let's remove it, very direct, no trick. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26440#pullrequestreview-3050754843 From mhaessig at openjdk.org Thu Jul 24 09:16:07 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 24 Jul 2025 09:16:07 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v2] In-Reply-To: References: Message-ID: On Fri, 4 Jul 2025 02:33:33 GMT, Dean Long wrote: >> The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > fix optimized build Thank you for this enhancement, @dean-long! I made a first pass to try and understand the logic, but ended up only commenting on cosmetics. I'll do a second pass next week. src/hotspot/share/runtime/deoptimization.cpp line 847: > 845: > 846: #ifndef PRODUCT > 847: #ifdef ASSERT Why is both `NOT_PRODUCT` and `ASSERT` needed here? So far, I thought that `ASSERT` implies `NOT_PRODUCT`. src/hotspot/share/runtime/deoptimization.cpp line 939: > 937: bool is_top_frame = true; > 938: int callee_size_of_parameters = 0; > 939: for (int i = 0; i < cur_array->frames(); i++) { I would suggest renaming `i`to `frame_idx` because there is one usage 50 lines down that would be much more clear with a more verbose name. src/hotspot/share/runtime/deoptimization.cpp line 947: > 945: > 946: // Get the oop map for this bci > 947: InterpreterOopMap mask; Perhaps you could move that down to line 977. It would just be one less variable to keep track. src/hotspot/share/runtime/deoptimization.cpp line 950: > 948: int cur_invoke_parameter_size = 0; > 949: int top_frame_expression_stack_adjustment = 0; > 950: int bci = iframe->interpreter_frame_bci(); `bci` is only used in the `BytecodeStream` constructor below. I would suggest to just call `iframe->interpreter_frame_bci()` in the constructor and forego the variable definition. src/hotspot/share/runtime/deoptimization.cpp line 999: > 997: (iframe_expr_size == mask.expression_stack_size() + callee_size_of_parameters) > 998: )))) > 999: { Suggestion: int iframe_expr_size = iframe->interpreter_frame_expression_stack_size(); int expr_stack_size_before = iframe_expr_size + (is_top_frame ? top_frame_expression_stack_adjustment : 0); if (!((is_top_frame && exec_mode == Unpack_exception && iframe_expr_size == 0) || (reexecute ? (expr_stack_size_before == mask.expression_stack_size() + cur_invoke_parameter_size) : (iframe_expr_size == mask.expression_stack_size() + callee_size_of_parameters) ))) { These parentheses can be simplified a bit. src/hotspot/share/runtime/vframeArray.cpp line 195: > 193: Bytecodes::Code code = Bytecodes::code_at(method(), bcp); > 194: assert(!Interpreter::bytecode_should_reexecute(code), "should_reexecute mismatch"); > 195: } This might be a candidate for `#ifdef ASSERT`? Suggestion: #ifdef ASSERT if (!reexec) { address bcp = method()->bcp_from(bci()); Bytecodes::Code code = Bytecodes::code_at(method(), bcp); assert(!Interpreter::bytecode_should_reexecute(code), "should_reexecute mismatch"); } #endif src/hotspot/share/runtime/vframeArray.cpp line 239: > 237: pc = Interpreter::deopt_reexecute_entry(method(), bcp); > 238: } > 239: assert(reexecute, "must be"); This assert is a bit redundant with the condition on this branch and `reexecute` not being assigned to. Suggestion: src/hotspot/share/runtime/vframeArray.hpp line 1: > 1: /* Please update the copyright year in this file. ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26121#pullrequestreview-3050547887 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227775314 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227853547 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227821569 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227832230 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227790942 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227879608 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227895145 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227887220 From mhaessig at openjdk.org Thu Jul 24 09:16:08 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 24 Jul 2025 09:16:08 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v2] In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 08:13:40 GMT, Manuel H?ssig wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> fix optimized build > > src/hotspot/share/runtime/deoptimization.cpp line 999: > >> 997: (iframe_expr_size == mask.expression_stack_size() + callee_size_of_parameters) >> 998: )))) >> 999: { > > Suggestion: > > int iframe_expr_size = iframe->interpreter_frame_expression_stack_size(); > int expr_stack_size_before = iframe_expr_size + (is_top_frame ? top_frame_expression_stack_adjustment : 0); > > if (!((is_top_frame && exec_mode == Unpack_exception && iframe_expr_size == 0) || > (reexecute ? > (expr_stack_size_before == mask.expression_stack_size() + cur_invoke_parameter_size) : > (iframe_expr_size == mask.expression_stack_size() + callee_size_of_parameters) > ))) { > > These parentheses can be simplified a bit. Suggestion: int iframe_expr_ssize = iframe->interpreter_frame_expression_stack_size(); int map_expr_invoke_ssize = mask.expression_stack_size() + cur_invoke_parameter_size; int expr_ssize_before = iframe_expr_ssize + (is_top_frame ? top_frame_expression_stack_adjustment : 0); int map_expr_callee_ssize = mask.expression_stack_size() + callee_size_of_parameters; if (!((is_top_frame && exec_mode == Unpack_exception && iframe_expr_ssize == 0) || (reexecute ? expr_ssize_before == map_expr_invoke_ssize : iframe_expr_ssize == map_expr_callee_ssize) )) { Personally, I would write something like this, but feel free to disregard. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2227794608 From mchevalier at openjdk.org Thu Jul 24 09:25:00 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 24 Jul 2025 09:25:00 GMT Subject: RFR: 8363357: Remove unused flag VerifyAdapterCalls In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 07:46:58 GMT, Marc Chevalier wrote: > It seems that the flag VerifyAdapterCalls is unused since [JDK-8350209](https://bugs.openjdk.org/browse/JDK-8350209), so pretty recently. > > Let's remove it, very direct, no trick. Thanks @chhagedorn & @TobiHartmann! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26440#issuecomment-3112717716 From mchevalier at openjdk.org Thu Jul 24 09:25:00 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 24 Jul 2025 09:25:00 GMT Subject: Integrated: 8363357: Remove unused flag VerifyAdapterCalls In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 07:46:58 GMT, Marc Chevalier wrote: > It seems that the flag VerifyAdapterCalls is unused since [JDK-8350209](https://bugs.openjdk.org/browse/JDK-8350209), so pretty recently. > > Let's remove it, very direct, no trick. This pull request has now been integrated. Changeset: 67e93281 Author: Marc Chevalier URL: https://git.openjdk.org/jdk/commit/67e93281a4f9e76419f1d6e05099ecf2214ebbfd Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod 8363357: Remove unused flag VerifyAdapterCalls Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/26440 From mhaessig at openjdk.org Thu Jul 24 09:29:55 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 24 Jul 2025 09:29:55 GMT Subject: RFR: 8354244: Use random data in MinMaxRed_Long data arrays In-Reply-To: References: Message-ID: <12f24F5wkFZHU2Y-lhvDKeyyBUy-DDzQVrYK5djx5AI=.fef6b9d3-d761-4429-9e57-c2829b24f59f@github.com> On Thu, 24 Jul 2025 06:45:59 GMT, Galder Zamarre?o wrote: > Simplified the data used in the tests added in [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513). The data does not need to have a specific shape because this test focuses on verifying the IR when vectorization kicks in, and when it does, the data can just be random. Shaping the data to control branch taken/not-taken paths makes sense when CMov macro expansion kicks in instead of vectorization. > > When switching to random data I noticed that the test was randomly failing. This was due to potential overflows that result from takin the min/max and then multiplying it by 11, so I've adjusted that section of the test as well. > > I've run the test on both aarch64 and x64 platforms where this test would get vectorized. To verify that I made sure the test passed and verified that the jtr output to make sure the IR conditions were matched. Thank you for this nice simplification, @galderz! It looks good to me as well. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26451#pullrequestreview-3050855728 From bkilambi at openjdk.org Thu Jul 24 09:32:06 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 24 Jul 2025 09:32:06 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v16] In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 08:38:52 GMT, Andrew Haley wrote: >> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: >> >> Refine comments in c2_MacroAssembler_aarch64.cpp > > src/hotspot/cpu/aarch64/aarch64_vector.ad line 261: > >> 259: >> 260: // Because the SVE2 "tbl" instruction is unpredicated and partial operations cannot be generated >> 261: // using masks, we currently disable this operation on machines where length_in_bytes < > > Suggestion: > > // using masks, we disable this operation on machines where length_in_bytes < Thanks. I use "currently" here because even now we can add support for cases where `length_in_bytes < MaxVectorSize && length_in_bytes > 8` but we currently do not have machines with SVE2 enabled and `length_in_bytes < MaxVectorSize && length_in_bytes > 8` for example - we do not have an SVE2 machine with `MaxVectorSize = 256` to test this operation for `length_in_bytes = 128` but we can add that support in future if such machines become available for testing. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2227979550 From eastigeevich at openjdk.org Thu Jul 24 10:05:57 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 24 Jul 2025 10:05:57 GMT Subject: RFR: 8359963: compiler/c2/aarch64/TestStaticCallStub.java fails with for code cache > 250MB the static call stub is expected to be implemented using far branch In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 15:24:42 GMT, Mikhail Ablakatov wrote: > The test assumed that hsdis is always available which is not the case. Make the test accept and scan either real or pseudo disassembly. test/hotspot/jtreg/compiler/c2/aarch64/TestStaticCallStub.java line 319: > 317: ProcessBuilder pb = ProcessTools.createLimitedTestJavaProcessBuilder(procArgs); > 318: OutputAnalyzer output = new OutputAnalyzer(pb.start()); > 319: System.out.println(output.getOutput()); I think it is worth to have `output.shouldHaveExitValue(0);`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26047#discussion_r2228065539 From eastigeevich at openjdk.org Thu Jul 24 10:21:54 2025 From: eastigeevich at openjdk.org (Evgeny Astigeevich) Date: Thu, 24 Jul 2025 10:21:54 GMT Subject: RFR: 8359963: compiler/c2/aarch64/TestStaticCallStub.java fails with for code cache > 250MB the static call stub is expected to be implemented using far branch In-Reply-To: References: Message-ID: On Mon, 30 Jun 2025 15:24:42 GMT, Mikhail Ablakatov wrote: > The test assumed that hsdis is always available which is not the case. Make the test accept and scan either real or pseudo disassembly. test/hotspot/jtreg/compiler/c2/aarch64/TestStaticCallStub.java line 272: > 270: while (itr.hasNext() && extracted.size() < n) { > 271: int left = n - extracted.size(); > 272: extractOpcodeOrBytecodes(itr.next()).stream().limit(left).forEach(extracted::add); You can detect whether you have hex codes or disassembly. See https://github.com/openjdk/jdk/blob/a86dd56de34f730b42593236f17118ef5ce4985a/test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitAArch64.java#L122 In TestOnSpinWaitAArch64 you can see how the checking code is organized not to depend on the instruction representation. IMO you don't need the Instruction class hierarchy. You need `nearStaticCallOpcodeSeq` and `farStaticCallOpcodeSeq` to be filled either with opcodes or hex codes. Of course their names need to be changed to something like `nearStaticCallInstSeq`. You will need to change `extractOpcodesN` and `extractOpcode` to `extractInstructionsN` and `extractInstruction`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26047#discussion_r2228099896 From galder at openjdk.org Thu Jul 24 10:34:38 2025 From: galder at openjdk.org (Galder =?UTF-8?B?WmFtYXJyZcOxbw==?=) Date: Thu, 24 Jul 2025 10:34:38 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F Message-ID: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: Benchmark (seed) (size) Mode Cnt Base Patch Units Diff VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. ------------- Commit messages: - Removed unnecessary assert methods - Adjust IR test after adding Move* vector support - Delete IR test because it's already covered by other test - Merge branch 'master' into topic.fp-bits-vector - Add longBitsToDouble and intBitsToFloat - Fix test for vectorized and add floatToRawIntBits - Add basic IR test - Add JMH benchmark for doubleTo*LongBits - Support doubleToRawLongBits - add floatToIntBits benchmark - ... and 4 more: https://git.openjdk.org/jdk/compare/c68697e1...b6ec784e Changes: https://git.openjdk.org/jdk/pull/26457/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26457&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8329077 Stats: 164 lines in 7 files changed: 153 ins; 4 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/26457.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26457/head:pull/26457 PR: https://git.openjdk.org/jdk/pull/26457 From duke at openjdk.org Thu Jul 24 10:48:55 2025 From: duke at openjdk.org (Samuel Chee) Date: Thu, 24 Jul 2025 10:48:55 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v2] In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 17 Jul 2025 14:31:18 GMT, Andrew Haley wrote: >> src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 1487: >> >>> 1485: if(!UseLSE) { >>> 1486: __ membar(__ AnyAny); >>> 1487: } >> >> Suggestion: >> >> if(!UseLSE) { >> // Prevent a later volatile load from being reordered with the STLXR in cmpxchg. >> __ membar(__ StoreLoad); >> } > > I wonder if it might be a good idea to add a `trailingDMB` boolean argument to `cmpxchg` and `atomic_##NAME` instead. @theRealAph coincidentally, I have been looking at `MacroAssembler::cmpxchgw` and `MacroAssembler::cmpxchgptr` recently, and it appears their trailing DMBs may also be unnecessary. I have been unable to find any particular use patterns which relies on the existence of these trailing dmbs, so it does not seem necessary to add the trailingDMB option. Although would like to hear your thoughts on the issue. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26000#discussion_r2228158946 From aph at openjdk.org Thu Jul 24 10:55:03 2025 From: aph at openjdk.org (Andrew Haley) Date: Thu, 24 Jul 2025 10:55:03 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v16] In-Reply-To: References: Message-ID: <8iU5uANNkE7vz5W5c47kFrh0mjZzvzbIRfI41QOKTrk=.3a657c40-c2c2-4e29-8364-352a240f06a5@github.com> On Thu, 24 Jul 2025 09:29:43 GMT, Bhavana Kilambi wrote: >> src/hotspot/cpu/aarch64/aarch64_vector.ad line 261: >> >>> 259: >>> 260: // Because the SVE2 "tbl" instruction is unpredicated and partial operations cannot be generated >>> 261: // using masks, we currently disable this operation on machines where length_in_bytes < >> >> Suggestion: >> >> // using masks, we disable this operation on machines where length_in_bytes < > > Thanks. I use "currently" here because even now we can add support for cases where `length_in_bytes < MaxVectorSize && length_in_bytes > 8` but we currently do not have machines with SVE2 enabled and `length_in_bytes < MaxVectorSize && length_in_bytes > 8` for example - we do not have an SVE2 machine with `MaxVectorSize = 256` to test this operation for `length_in_bytes = 128` but we can add that support in future if such machines become available for testing. Sure, but "currently" doesn't help the reader to understand that. If you want to say we don't support this because at the time of writing we don't have machines with SVE2 enabled and length_in_bytes >128 so we can't test it, then say so. Explicitly. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2228171020 From mli at openjdk.org Thu Jul 24 11:25:37 2025 From: mli at openjdk.org (Hamlin Li) Date: Thu, 24 Jul 2025 11:25:37 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v4] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. > NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. > Also add some comments and do some other simple cleanup. > > Thanks! Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: remove NativeFarCall/RelocCall ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26370/files - new: https://git.openjdk.org/jdk/pull/26370/files/f72db245..37820220 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26370&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26370&range=02-03 Stats: 133 lines in 2 files changed: 11 ins; 96 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/26370.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26370/head:pull/26370 PR: https://git.openjdk.org/jdk/pull/26370 From ghan at openjdk.org Thu Jul 24 15:16:31 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Thu, 24 Jul 2025 15:16:31 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" Message-ID: I'm able to consistently reproduce the problem using the following command line and test program ? java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java import java.util.Arrays; public class Test{ public static void main(String[] args) { System.out.println("begin"); byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; System.out.println(Arrays.equals(arr1, arr2)); System.out.println("end"); } } >From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. A reference to the relevant code paths is provided below : image1 image2 On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size classification remains single_size regardless. This classification leads to mismatches in internal logic, such as when T_LONG values are moved into a T_ADDRESS-typed destination on 64-bit platforms, despite both being 64-bit wide. I believe this discrepancy is largely historical, originating from the need to support both 32-bit and 64-bit architectures. Given that most modern platforms are 64-bit, I propose to simplify and clarify this handling by allowing T_ADDRESS to accept T_LONG data during move operations when targeting 64-bit platforms. So i suggest relaxing the type checks for platform-dependent types such as T_ADDRESS and T_METADATA. ------------- Commit messages: - 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" Changes: https://git.openjdk.org/jdk/pull/26462/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26462&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8359235 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26462.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26462/head:pull/26462 PR: https://git.openjdk.org/jdk/pull/26462 From tschatzl at openjdk.org Thu Jul 24 15:29:56 2025 From: tschatzl at openjdk.org (Thomas Schatzl) Date: Thu, 24 Jul 2025 15:29:56 GMT Subject: RFR: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 In-Reply-To: References: Message-ID: <9OGIoq7EaDQnBhnzMzX3sHGq99xQHthjkG4xbxvSDzc=.9d5ed8b8-62bf-4f4e-b0dc-3cfb3193afca@github.com> On Sat, 19 Jul 2025 01:39:12 GMT, Dean Long wrote: > This PR removes the recently added lock around set_guard_value, using instead Atomic::cmpxchg to atomically update bit-fields of the guard value. Further, it takes a fast-path that uses the previous direct store when at a safepoint. Combined, these changes should get us back to almost where we were before in terms of overhead. If necessary, we could go even further and allow make_not_entrant() to perform a direct byte store, leaving 24 bits for the guard value. Afaics the `NMethodEntryBarrier_lock` declaration/definition can also be removed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26399#issuecomment-3113913189 From duke at openjdk.org Thu Jul 24 18:40:07 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 18:40:07 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v40] In-Reply-To: References: Message-ID: On Wed, 12 Mar 2025 18:49:38 GMT, Vladimir Kozlov wrote: >> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: >> >> Use CompiledICLocker instead of CompiledIC_lock > > src/hotspot/share/code/nmethod.cpp line 1514: > >> 1512: >> 1513: // Copy all nmethod data outside of header >> 1514: memcpy(content_begin(), nm.content_begin(), nm.size() - nm.header_size()); > > You would not need it if you `memcpy` whole nmethod. Decided not to use `memcpy` for the time being https://github.com/openjdk/jdk/pull/23573#discussion_r2220591570 > src/hotspot/share/code/nmethod.cpp line 1595: > >> 1593: } >> 1594: >> 1595: bool nmethod::is_relocatable() const { > > Native nmethods should be skipped too. May be also check `is_in_use()`. `is_relocatable()` was updated to check `is_java_method()` and `is_in_use()` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229264958 PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229269758 From dlong at openjdk.org Thu Jul 24 18:43:53 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 24 Jul 2025 18:43:53 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 15:10:37 GMT, Guanqiang Han wrote: > I'm able to consistently reproduce the problem using the following command line and test program ? > > java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java > > import java.util.Arrays; > public class Test{ > public static void main(String[] args) { > System.out.println("begin"); > byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > System.out.println(Arrays.equals(arr1, arr2)); > System.out.println("end"); > } > } > > From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). > > In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch > Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. > > In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. > > Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. > > A reference to the relevant code paths is provided below : > image1 > image2 > > On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. > > However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size classification remains single_size regardless. > > This classification... I think it is good to detect mismatches between T_LONG and T_ADDRESS, so I'd rather not relax the checks. Why not fix do_vectorizedMismatch() to use new_register(T_ADDRESS)? And maybe file a separate RFE to cleanup this confusion that new_pointer_register() causes. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26462#issuecomment-3114475833 From duke at openjdk.org Thu Jul 24 18:51:12 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 18:51:12 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v4] In-Reply-To: References: <3C9M1LWF86Hjsu8s3SbJLBpP5HfI3BHOkhid2SHFqVw=.6195af54-50c3-480b-8994-df7c317ac3bc@github.com> Message-ID: On Mon, 17 Mar 2025 22:18:15 GMT, Vladimir Kozlov wrote: >> Yes, we need to update call sites. Should we replace all resolved calls with calls to `resolve_*_call` blobs? >> Actually `clean_if_nmethod_is_unloaded()` do that. May be we indeed need to call `nmethod::cleanup_inline_caches_impl()` but without VM operation. > > We need to do that only for new copy of nmethod and not for old. `clear_inline_caches()` is called for the new copy. A safe point is not required because the code is not installed and therefore not executing ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229298345 From dlong at openjdk.org Thu Jul 24 18:51:22 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 24 Jul 2025 18:51:22 GMT Subject: RFR: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 [v2] In-Reply-To: References: Message-ID: > This PR removes the recently added lock around set_guard_value, using instead Atomic::cmpxchg to atomically update bit-fields of the guard value. Further, it takes a fast-path that uses the previous direct store when at a safepoint. Combined, these changes should get us back to almost where we were before in terms of overhead. If necessary, we could go even further and allow make_not_entrant() to perform a direct byte store, leaving 24 bits for the guard value. Dean Long has updated the pull request incrementally with one additional commit since the last revision: remove NMethodEntryBarrier_lock ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26399/files - new: https://git.openjdk.org/jdk/pull/26399/files/ecc6e68e..e05605eb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26399&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26399&range=00-01 Stats: 4 lines in 2 files changed: 0 ins; 4 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26399.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26399/head:pull/26399 PR: https://git.openjdk.org/jdk/pull/26399 From duke at openjdk.org Thu Jul 24 18:56:05 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 18:56:05 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v13] In-Reply-To: References: <1_OG6abFZwl9AWbxsm5eCrL6RWq1wTnPngdDky6V3f8=.3a126cd0-22c0-45bb-9f85-3f096de116d6@github.com> Message-ID: On Thu, 8 May 2025 19:59:01 GMT, Chad Rakoczy wrote: >> Actually the issue is not during code buffer expansion. It's called when creating a new nmethod that I can only get to occur when using the Graal compiler. So it may not be true that calls always have trampolines in the case of Graal. This _fix_ may just make the bug harder to encounter > > For debug builds Hotspot uses the 2M range to determine if there should be a trampoline or not for a call. Graal uses 128M regardless of debug or release builds. This means that Graal compiled methods may not have trampolines but this check will expect them too. I reverted this change as it just means there is a difference on how Graal and Hotspot determine max branch range This change was reverted ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229309026 From duke at openjdk.org Thu Jul 24 18:56:06 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 18:56:06 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v13] In-Reply-To: References: <-Y_yLYd2I_ZJk0kcEF8ZqyVKtI0OiXT0WFtvHLhUWJU=.be99c2c6-e27d-448d-910d-ee5dfc97e12d@github.com> Message-ID: On Fri, 25 Apr 2025 18:04:37 GMT, Chad Rakoczy wrote: >> @chadrako, what issue are you trying to fix with the code? > > After relocation it is possible that the call can no longer reach the destination without calling the trampoline The implemented solution for this issue is to allow trampoline relocations to fix their owners instead of modifying the call relocation logic ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229308367 From duke at openjdk.org Thu Jul 24 18:56:07 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 18:56:07 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v13] In-Reply-To: References: Message-ID: On Mon, 28 Apr 2025 19:28:33 GMT, Chad Rakoczy wrote: >> src/hotspot/share/code/relocInfo.cpp line 379: >> >>> 377: } else { >>> 378: // Reassert the callee address, this time in the new copy of the code. >>> 379: pd_set_call_destination(callee); >> >> if (src->contains(callee)) { >> // ... >> int offset = pointer_delta_as_int(callee, orig_addr); >> callee = addr() + offset; >> } >> pd_set_call_destination(callee); > > I'll use this refactor to remove the else but can't use `pointer_delta_as_int` as it only works for positive offsets This change is no longer needed. Trampolines are responsible for fixing their owners ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229311317 From duke at openjdk.org Thu Jul 24 19:00:11 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 19:00:11 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v15] In-Reply-To: References: Message-ID: On Wed, 14 May 2025 15:58:53 GMT, Tom Rodriguez wrote: >> I assume that's copied from `JVMCINMethodData::invalidate_nmethod_mirror` which was updated in https://github.com/openjdk/jdk/commit/f81c192da929d72be5134ccf195be2a985737504. The description for [JDK-8234359](https://bugs.openjdk.org/browse/JDK-8234359) implies that this somehow avoids enqueuing potentially dead objects to the SATB buffer. Is that what we want here @tkrodriguez ? > > It should be passing true here as we are not in the middle of a GC so it should be alive and valid. This change was removed as JVMCI nmethods with a mirror are currently excluded ([JDK-8357926](https://bugs.openjdk.org/browse/JDK-8357926)) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229319404 From duke at openjdk.org Thu Jul 24 19:00:12 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 19:00:12 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v15] In-Reply-To: References: Message-ID: On Wed, 28 May 2025 17:37:54 GMT, Tom Rodriguez wrote: >> src/hotspot/share/jvmci/jvmciRuntime.cpp line 858: >> >>> 856: >>> 857: JVMCIEnv* jvmciEnv = nullptr; >>> 858: HotSpotJVMCI::InstalledCode::set_address(jvmciEnv, nmethod_mirror, (jlong)(nm)); >> >> What's the sync story here? Any lock protecting this? If not, I wonder if readers are okay with inconsistencies. I haven't checked. > > In the current implementation the fields of InstalledCode are initialized to valid values from the nmethod* during code installation. Those fields only ever transition to 0 as part of nmethod invalidation. Hosted methods may read `InstalledCode.entryPoint` and dispatch to it if it's non-null. So a transition of these values should be safe if they moved from a non-null value to another non-null value and the existing nmethod stayed alive until the next safepoint in the normal nmethod reclamation cycle. Currently writes to those fields by the VM are done in make_not_entrant or at a safepoint so we might want to perform more explicit locking to support transfer of these values. > > We might consider revisiting the design of InstalledCode itself now that Graal is aligned with the JDK. Backward compatibility precluded that in the past. That might simplify the whole thing. This change was removed as JVMCI nmethods with a mirror are currently excluded ([JDK-8357926](https://bugs.openjdk.org/browse/JDK-8357926)) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229321074 From duke at openjdk.org Thu Jul 24 19:05:10 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 19:05:10 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v35] In-Reply-To: References: Message-ID: On Sun, 13 Jul 2025 09:36:48 GMT, Andrew Haley wrote: >> Chad Rakoczy has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 95 commits: >> >> - Typo >> - Merge branch 'master' into JDK-8316694-Final >> - Update justification for skipping CallRelocation >> - Enclose ImmutableDataReferencesCounterSize in parentheses >> - Let trampolines fix their owners >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Update how call sites are fixed >> - Merge remote-tracking branch 'origin/master' into JDK-8316694-Final >> - Fix pointer printing >> - Use set_destination_mt_safe >> - ... and 85 more: https://git.openjdk.org/jdk/compare/117f0b40...66d73c16 > > src/hotspot/share/code/nmethod.cpp line 1392: > >> 1390: >> 1391: >> 1392: nmethod::nmethod(nmethod* nm) : CodeBlob(nm->_name, nm->_kind, nm->_size, nm->_header_size) > > Should this be a copy constructor? > > nmethod::nmethod(const nmethod &nm) : CodeBlob(nm._name, nm._kind, nm._size, nm._header_size) > > Even if we don't make it a copy constructor, it looks like its nmethod argument should be `const`, but I haven't checked very deeply. The constructor was updated to be a copy constructor ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229328235 From duke at openjdk.org Thu Jul 24 19:05:11 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 19:05:11 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v40] In-Reply-To: References: Message-ID: <1QJ2fDZMIMhn7vXB8_1Gpg_ZG2aFkuzWgmV9hmvI_E0=.e9f232f9-89e6-4c9b-a713-3cc016be0520@github.com> On Thu, 13 Mar 2025 13:54:43 GMT, Evgeny Astigeevich wrote: >> src/hotspot/share/code/nmethod.cpp line 1396: >> >>> 1394: } >>> 1395: >>> 1396: nmethod::nmethod(nmethod& nm) : CodeBlob(nm.name(), CodeBlobKind::Nmethod, nm.size(), nm.header_size()) >> >> Should this be `clone()` method instead of constructor. Then you will not need `new()`. > > +1 Decided not to use `memcpy` for the time being https://github.com/openjdk/jdk/pull/23573#discussion_r2220591570 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229331986 From dlong at openjdk.org Thu Jul 24 19:05:55 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 24 Jul 2025 19:05:55 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v2] In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 08:07:42 GMT, Manuel H?ssig wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> fix optimized build > > src/hotspot/share/runtime/deoptimization.cpp line 847: > >> 845: >> 846: #ifndef PRODUCT >> 847: #ifdef ASSERT > > Why is both `NOT_PRODUCT` and `ASSERT` needed here? So far, I thought that `ASSERT` implies `NOT_PRODUCT`. Unfortunately, they are not the same, thanks to "optimized" builds. We can clean this up if optimizes builds get removed. See https://bugs.openjdk.org/browse/JDK-8183287. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2229334193 From duke at openjdk.org Thu Jul 24 19:11:11 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 19:11:11 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v38] In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 22:29:22 GMT, Vladimir Kozlov wrote: >> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: >> >> Require caller to hold locks > > src/hotspot/share/code/codeBehaviours.cpp line 46: > >> 44: bool DefaultICProtectionBehaviour::is_safe(nmethod* method) { >> 45: return SafepointSynchronize::is_at_safepoint() || CompiledIC_lock->owned_by_self() || method->is_not_installed(); >> 46: } > > Can you rename `method` to `nm` as we call it in similar code in GCs? This has been updated > src/hotspot/share/code/nmethod.cpp line 1630: > >> 1628: if (!is_java_method()) { >> 1629: return false; >> 1630: } > > This should be first check. This has been fixed > src/hotspot/share/code/nmethod.cpp line 2453: > >> 2451: // Free memory if this is the last nmethod referencing immutable data >> 2452: if (get_immutable_data_references_counter() == 1) { >> 2453: os::free(_immutable_data); > > You should add assert(get_immutable_data_references_counter() > 0 before `if (counter == 1)` > and zero it when freed. This has been fixed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229341201 PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229345858 PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229346439 From duke at openjdk.org Thu Jul 24 19:11:12 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 19:11:12 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v32] In-Reply-To: <3f1UnDuYp2iYVcciKF-BqdChOOY2PJJG5R0QuyfblVM=.37a92dfd-5de4-4924-83c5-f9c2e5d7548c@github.com> References: <73AnlXOv0T8K25DgsNdH1PkBjcBXz0f3bBYZx44LpAw=.439f5383-ffd1-44e8-9e11-4b5af9b6a278@github.com> <3f1UnDuYp2iYVcciKF-BqdChOOY2PJJG5R0QuyfblVM=.37a92dfd-5de4-4924-83c5-f9c2e5d7548c@github.com> Message-ID: On Wed, 2 Jul 2025 20:47:44 GMT, Chad Rakoczy wrote: >> test/hotspot/jtreg/vmTestbase/nsk/jvmti/NMethodRelocation/nmethodrelocation.java line 37: >> >>> 35: import jdk.test.whitebox.code.BlobType; >>> 36: >>> 37: public class nmethodrelocation extends DebugeeClass { >> >> Why is the class name not following the Java code conventions? > > I was following the naming conventions of other JVMTI tests. > https://github.com/openjdk/jdk/tree/master/test/hotspot/jtreg/vmTestbase/nsk/jvmti Is the name of this test acceptable? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229340005 From dlong at openjdk.org Thu Jul 24 19:14:11 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 24 Jul 2025 19:14:11 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v3] In-Reply-To: References: Message-ID: > The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. Dean Long has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/runtime/deoptimization.cpp Co-authored-by: Manuel H?ssig ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26121/files - new: https://git.openjdk.org/jdk/pull/26121/files/5b7d4bca..6042a475 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=01-02 Stats: 5 lines in 1 file changed: 0 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26121.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26121/head:pull/26121 PR: https://git.openjdk.org/jdk/pull/26121 From dlong at openjdk.org Thu Jul 24 19:25:42 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 24 Jul 2025 19:25:42 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v4] In-Reply-To: References: Message-ID: > The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. Dean Long has updated the pull request incrementally with one additional commit since the last revision: better name for frame index ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26121/files - new: https://git.openjdk.org/jdk/pull/26121/files/6042a475..c93ca9e1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=02-03 Stats: 5 lines in 1 file changed: 1 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26121.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26121/head:pull/26121 PR: https://git.openjdk.org/jdk/pull/26121 From dlong at openjdk.org Thu Jul 24 19:34:15 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 24 Jul 2025 19:34:15 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v5] In-Reply-To: References: Message-ID: > The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. Dean Long has updated the pull request incrementally with two additional commits since the last revision: - Update src/hotspot/share/runtime/vframeArray.cpp Co-authored-by: Manuel H?ssig - Update src/hotspot/share/runtime/vframeArray.cpp Co-authored-by: Manuel H?ssig ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26121/files - new: https://git.openjdk.org/jdk/pull/26121/files/c93ca9e1..54919be0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=03-04 Stats: 3 lines in 1 file changed: 2 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26121.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26121/head:pull/26121 PR: https://git.openjdk.org/jdk/pull/26121 From dlong at openjdk.org Thu Jul 24 19:38:43 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 24 Jul 2025 19:38:43 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v6] In-Reply-To: References: Message-ID: > The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. Dean Long has updated the pull request incrementally with one additional commit since the last revision: reviewer suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26121/files - new: https://git.openjdk.org/jdk/pull/26121/files/54919be0..1e46cce1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=04-05 Stats: 8 lines in 2 files changed: 3 ins; 3 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26121.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26121/head:pull/26121 PR: https://git.openjdk.org/jdk/pull/26121 From dlong at openjdk.org Thu Jul 24 20:03:33 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 24 Jul 2025 20:03:33 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v7] In-Reply-To: References: Message-ID: > The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. Dean Long has updated the pull request incrementally with one additional commit since the last revision: readability suggestion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26121/files - new: https://git.openjdk.org/jdk/pull/26121/files/1e46cce1..535fbb05 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=05-06 Stats: 8 lines in 1 file changed: 0 ins; 1 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/26121.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26121/head:pull/26121 PR: https://git.openjdk.org/jdk/pull/26121 From duke at openjdk.org Thu Jul 24 20:50:06 2025 From: duke at openjdk.org (Chad Rakoczy) Date: Thu, 24 Jul 2025 20:50:06 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v24] In-Reply-To: References: <8eMagllT-Sxnvp6tnIkYNyUe7PetzaHXqhiqHnAiApU=.3b4c422a-e9ad-492f-a82d-98ee16f053dd@github.com> Message-ID: On Thu, 19 Jun 2025 23:54:20 GMT, Chad Rakoczy wrote: >>> We still need this check in the event that there is a direct call that no longer reaches. >> >> OK, I didn't realize that was what Relocation::pd_set_call_destination() was doing. I think it would be better for the CPU-specific code to take care of that, rather than the shared code. We already have functions like NativeCall::set_destination_mt_safe() that do the right thing regarding trampolines. I think this could be refactored into a commonm function that Relocation::pd_set_call_destination() could also use. Sorry for the churn, but hopefully we are converging on a solution. I thought I had done the refactoring for 8321509, but it looks like I went with the simply fix at the time of adding a parameter to set_destination_mt_safe() to make it lock-free. > > Thanks for the suggestion. I updated to use `set_destination_mt_safe()` instead ([reference](https://github.com/openjdk/jdk/pull/23573/commits/b02e8bdb63db8042418b92ade4a26647e4e2dd8b)) This change has been reverted. Trampolines are responsible for fixing their owners so this is no longer needed ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23573#discussion_r2229542873 From ghan at openjdk.org Fri Jul 25 03:25:53 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Fri, 25 Jul 2025 03:25:53 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 18:41:13 GMT, Dean Long wrote: >> I'm able to consistently reproduce the problem using the following command line and test program ? >> >> java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java >> >> import java.util.Arrays; >> public class Test{ >> public static void main(String[] args) { >> System.out.println("begin"); >> byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; >> byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; >> System.out.println(Arrays.equals(arr1, arr2)); >> System.out.println("end"); >> } >> } >> >> From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). >> >> In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch >> Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. >> >> In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. >> >> Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. >> >> A reference to the relevant code paths is provided below : >> image1 >> image2 >> >> On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. >> >> However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size class... > > I think it is good to detect mismatches between T_LONG and T_ADDRESS, so I'd rather not relax the checks. Why not fix > do_vectorizedMismatch() to use new_register(T_ADDRESS)? And maybe file a separate RFE to cleanup this confusion that new_pointer_register() causes. @dean-long Thanks for the feedback! Initially, I also considered modifying do_vectorizedMismatch() to use new_register(T_ADDRESS), as you suggested. However, I found that this change would trigger a series of follow-up modifications. as shown below: image3 image4 That?s why I opted for a more localized fix . I believe this is still a reasonable compromise. On 64-bit platforms, both T_ADDRESS and T_LONG are 64-bit wide, and general-purpose registers are capable of holding either type. Moreover, the code already uses movptr for moving 64-bit wide data , as shown below: image5 So semantically, this modification in PR seems safe and practical in this context. That said, I fully agree that the current treatment of new_pointer_register() is a bit confusing, If you, or other experts familiar with this area, believe the RFE is reasonable and it gets opened, I?d be happy to take on the implementation. Thanks again for your insights, and I look forward to your feedback. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26462#issuecomment-3116255535 From xgong at openjdk.org Fri Jul 25 03:26:36 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 25 Jul 2025 03:26:36 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: Message-ID: > This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. > > ### Background > Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. > > ### Implementation > > #### Challenges > Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. > > For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: > - SPECIES_64: Single operation with mask (8 elements, 256-bit) > - SPECIES_128: Single operation, full register (16 elements, 512-bit) > - SPECIES_256: Two operations + merge (32 elements, 1024-bit) > - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) > > Use `ByteVector.SPECIES_512` as an example: > - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. > - It requires 4 times of vector gather-loads to finish the whole operation. > > > byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] > int[] idx = [0, 1, 2, 3, ..., 63, ...] > > 4 gather-load: > idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] > idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] > idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] > idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] > merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] > > > #### Solution > The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. > > Here is the main changes: > - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. > - Added `VectorSliceNode` for result merging. > - Added `VectorMaskWidenNode` for mask spliting and type conversion fo... Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Refine IR pattern and clean backend rules ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26236/files - new: https://git.openjdk.org/jdk/pull/26236/files/c39dade2..be63ade6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26236&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26236&range=01-02 Stats: 854 lines in 17 files changed: 308 ins; 275 del; 271 mod Patch: https://git.openjdk.org/jdk/pull/26236.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236 PR: https://git.openjdk.org/jdk/pull/26236 From fyang at openjdk.org Fri Jul 25 03:37:02 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 25 Jul 2025 03:37:02 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v4] In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 11:25:37 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. >> NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. >> Also add some comments and do some other simple cleanup. >> >> Thanks! > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > remove NativeFarCall/RelocCall It's great to see that NativeFarCall class is factored out. Thanks for the update! Overall looks good. Would you mind two minor tweaks about code comment? src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 51: > 49: // NativeCall > 50: // > 51: // Implements direct far calling loading an address from the stub section version of reloc call. Suggestion: `// Implements indirect far call loading an address from the stub section of reloc call.` And I think this comment should be moved to immediately before definition of `MacroAssembler::reloc_call` [1]. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L4982 src/hotspot/cpu/riscv/nativeInst_riscv.hpp line 114: > 112: // call instructions (used to manipulate inline caches, primitive & > 113: // DSO calls, etc.). > 114: // On riscv, NativeCall is a reloc call. Suggestion: `NativeCall is reloc call on RISC-V. See MacroAssembler::reloc_call` ------------- PR Review: https://git.openjdk.org/jdk/pull/26370#pullrequestreview-3053949128 PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2230047481 PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2230035010 From xgong at openjdk.org Fri Jul 25 03:43:54 2025 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 25 Jul 2025 03:43:54 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation In-Reply-To: References: <38f2bvFqiVNQGGpMif0iflVFD8wXnyw4SwtKxwi_Dmo=.276fb2fb-b80c-4ea7-a32f-c326294f442a@github.com> <1xAfD3mz5cbQpYtCYxoHqRQcOLadLKNHrvMUtFtFbGo=.34e5780a-e37a-427c-b745-1ed422c7a008@github.com> <4tejg5hp-eHBmAEvKbpTg_mv_TUYU5kg0HIccmWyac8=.3638758e-5000-4d1f-924f-abb4a21952c6@github.com> Message-ID: On Thu, 17 Jul 2025 11:28:18 GMT, Fei Gao wrote: >>> I like this idea! The first one looks better, in which `concate` would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios. >> >> Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion! > >> >> Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion! > > Thanks! I?d suggest also highlighting `aarch64` in the JBS title, so others who are interested won?t miss it. Hi @fg1417 , the latest commit refactored the whole IR patterns and `LoadVectorGather[Masked]` IR based on above discussions. Could you please help take another look? Thanks~ ### Main changes - Type of `LoadVectorGather[Masked]` are changed from original subword vector type to `int` vector type. Additionally, a `_mem_bt` member is added to denote the load type. - backend rules are clean - mask generation for partial cases are clean - Define `VectorConcatenateNode` and remove `VectorSliceNode`. - `VectorConcatenateNode` has the same function with SVE/NEON's `uzp1`. It is used to narrow the element size of input to half size and concatenate narrowed results from src1 and src2 to dst (src1 is in lower part and src2 is in higher part of dst). - The matcher helper function `vector_idea_reg_size()` is needless and removed. Originally it is used by `VectorSlice`. - More IR tests are added for kinds of different vector species. ### IR implementation - It needs one gather-load - `LoadVectorGather (bt: int)` + `VectorCastI2X (bt: byte|short)` - It needs two gather-loads and merge - step-1: `v1 = LoadVectorGather (bt: int)`, `v2 = LoadVectorGather (bt: int)` - step-2: `merge = VectorConcatenate(v1, v2) (bt: short)` - step-3: (only byte) `v = VectorCastS2X(merge) (bt: byte)` - It needs four gather-loads and merge - (only byte vector) - step-1: `v1 = LoadVectorGather (bt: int)`, `v2 = LoadVectorGather (bt: int)` - step-2: `merge1 = VectorConcatenate(v1, v2) (bt: short)` - step-3: `v3 = LoadVectorGather (bt: int)`, `v4 = LoadVectorGather (bt: int)` - step-4: `merge2 = VectorConcatenate(v3, v4) (bt: short)` - step-5: `v = VectorConcatenate(merge1, merge2) (bt: byte)` ### Performance change It can observe about 4% ~ 9% uplifts on some micro benchmarks. No significant regressions are observed. Following is the performance change on NVIDIA Grace with latest commit: Benchmark (SIZE) Mode Units Before After Gain microByteGather128 64 thrpt ops/ms 48405.283 48668.502 1.005 microByteGather128 256 thrpt ops/ms 12821.924 12662.342 0.987 microByteGather128 1024 thrpt ops/ms 3253.778 3198.608 0.983 microByteGather128 4096 thrpt ops/ms 817.604 801.250 0.979 microByteGather128_MASK 64 thrpt ops/ms 46124.722 48334.916 1.047 microByteGather128_MASK 256 thrpt ops/ms 12152.575 12652.821 1.041 microByteGather128_MASK 1024 thrpt ops/ms 3075.066 3193.787 1.038 microByteGather128_MASK 4096 thrpt ops/ms 812.738 803.017 0.988 microByteGather128_MASK_NZ_OFF 64 thrpt ops/ms 46130.244 48384.633 1.048 microByteGather128_MASK_NZ_OFF 256 thrpt ops/ms 12139.800 12624.298 1.039 microByteGather128_MASK_NZ_OFF 1024 thrpt ops/ms 3078.040 3203.049 1.040 microByteGather128_MASK_NZ_OFF 4096 thrpt ops/ms 812.716 802.712 0.987 microByteGather128_NZ_OFF 64 thrpt ops/ms 48369.524 48643.937 1.005 microByteGather128_NZ_OFF 256 thrpt ops/ms 12814.552 12672.757 0.988 microByteGather128_NZ_OFF 1024 thrpt ops/ms 3253.294 3202.016 0.984 microByteGather128_NZ_OFF 4096 thrpt ops/ms 818.389 805.488 0.984 microByteGather64 64 thrpt ops/ms 48491.633 50615.848 1.043 microByteGather64 256 thrpt ops/ms 12340.778 13156.762 1.066 microByteGather64 1024 thrpt ops/ms 3067.592 3322.777 1.083 microByteGather64 4096 thrpt ops/ms 767.111 832.409 1.085 microByteGather64_MASK 64 thrpt ops/ms 48526.894 50730.468 1.045 microByteGather64_MASK 256 thrpt ops/ms 12340.398 13159.723 1.066 microByteGather64_MASK 1024 thrpt ops/ms 3066.227 3327.964 1.085 microByteGather64_MASK 4096 thrpt ops/ms 767.390 833.327 1.085 microByteGather64_MASK_NZ_OFF 64 thrpt ops/ms 48472.912 51287.634 1.058 microByteGather64_MASK_NZ_OFF 256 thrpt ops/ms 12331.578 13258.954 1.075 microByteGather64_MASK_NZ_OFF 1024 thrpt ops/ms 3070.319 3345.911 1.089 microByteGather64_MASK_NZ_OFF 4096 thrpt ops/ms 767.097 838.008 1.092 microByteGather64_NZ_OFF 64 thrpt ops/ms 48492.984 51224.743 1.056 microByteGather64_NZ_OFF 256 thrpt ops/ms 12334.944 13240.494 1.073 microByteGather64_NZ_OFF 1024 thrpt ops/ms 3067.754 3343.387 1.089 microByteGather64_NZ_OFF 4096 thrpt ops/ms 767.123 837.642 1.091 microShortGather128 64 thrpt ops/ms 37717.835 37041.162 0.982 microShortGather128 256 thrpt ops/ms 9467.160 9890.109 1.044 microShortGather128 1024 thrpt ops/ms 2376.520 2481.753 1.044 microShortGather128 4096 thrpt ops/ms 595.030 621.274 1.044 microShortGather128_MASK 64 thrpt ops/ms 37655.017 37036.887 0.983 microShortGather128_MASK 256 thrpt ops/ms 9471.324 9859.461 1.040 microShortGather128_MASK 1024 thrpt ops/ms 2376.811 2477.106 1.042 microShortGather128_MASK 4096 thrpt ops/ms 595.049 620.082 1.042 microShortGather128_MASK_NZ_OFF 64 thrpt ops/ms 37636.229 37029.468 0.983 microShortGather128_MASK_NZ_OFF 256 thrpt ops/ms 9483.674 9867.427 1.040 microShortGather128_MASK_NZ_OFF 1024 thrpt ops/ms 2379.877 2478.608 1.041 microShortGather128_MASK_NZ_OFF 4096 thrpt ops/ms 594.710 620.455 1.043 microShortGather128_NZ_OFF 64 thrpt ops/ms 37706.896 37044.505 0.982 microShortGather128_NZ_OFF 256 thrpt ops/ms 9487.006 9882.079 1.041 microShortGather128_NZ_OFF 1024 thrpt ops/ms 2379.571 2482.341 1.043 microShortGather128_NZ_OFF 4096 thrpt ops/ms 595.099 621.392 1.044 microShortGather64 64 thrpt ops/ms 37773.485 37502.698 0.992 microShortGather64 256 thrpt ops/ms 9591.046 9640.225 1.005 microShortGather64 1024 thrpt ops/ms 2406.013 2420.376 1.005 microShortGather64 4096 thrpt ops/ms 603.270 606.541 1.005 microShortGather64_MASK 64 thrpt ops/ms 37781.860 37479.295 0.991 microShortGather64_MASK 256 thrpt ops/ms 9608.015 9657.010 1.005 microShortGather64_MASK 1024 thrpt ops/ms 2406.828 2422.170 1.006 microShortGather64_MASK 4096 thrpt ops/ms 602.965 606.283 1.005 microShortGather64_MASK_NZ_OFF 64 thrpt ops/ms 37740.577 37487.740 0.993 microShortGather64_MASK_NZ_OFF 256 thrpt ops/ms 9593.611 9663.041 1.007 microShortGather64_MASK_NZ_OFF 1024 thrpt ops/ms 2404.846 2423.493 1.007 microShortGather64_MASK_NZ_OFF 4096 thrpt ops/ms 602.691 605.911 1.005 microShortGather64_NZ_OFF 64 thrpt ops/ms 37723.586 37507.899 0.994 microShortGather64_NZ_OFF 256 thrpt ops/ms 9589.985 9630.033 1.004 microShortGather64_NZ_OFF 1024 thrpt ops/ms 2405.774 2423.655 1.007 microShortGather64_NZ_OFF 4096 thrpt ops/ms 602.778 606.151 1.005 ------------- PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3116280179 From jbhateja at openjdk.org Fri Jul 25 03:44:00 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 25 Jul 2025 03:44:00 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v5] In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 02:56:43 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: > > - Add JMH benchmarks for cast chain transformation > - Merge branch 'master' into JDK-8356760 > - Refactor the implementation > > Do the convertion in C2's IGVN phase to cover more cases. > - Merge branch 'master' into JDK-8356760 > - Simplify the test code > - Address some review comments > > Add support for the following patterns: > toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) > toLong(maskAll(false)) => 0 > > And add more test cases. > - Merge branch 'master' into JDK-8356760 > - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases > > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would > set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent > to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is > relative smaller than that of `fromLong`. This patch does the conversion > for these cases if `l` is a compile time constant. > > And this conversion also enables further optimizations that recognize > maskAll patterns, see [1]. > > Some JTReg test cases are added to ensure the optimization is effective. > > I tried many different ways to write a JMH benchmark, but failed. Since > the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific > compile-time constant, the statement will be hoisted out of the loop. > If we don't use a loop, the hotspot will become other instructions, and > no obvious performance change was observed. However, combined with the > optimization of [1], we can observe a performance improvement of about > 7% on both aarch64 and x64. > > The patch was tested on both aarch64 and x64, all of tier1 tier2 and > tier3 tests passed. > > [1] https://github.com/openjdk/jdk/pull/24674 Your changes looks good to me. Thanks @erifan src/hotspot/share/opto/vectornode.cpp line 1986: > 1984: Node* VectorMaskToLongNode::Ideal_MaskAll(PhaseGVN* phase) { > 1985: Node* in1 = in(1); > 1986: // VectorMaskToLong follows a VectorStoreMask if predicate is not supported. It's always good to add an assertion check for coding assumptions. ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25793#pullrequestreview-3053985885 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2230064475 From duke at openjdk.org Fri Jul 25 07:28:42 2025 From: duke at openjdk.org (erifan) Date: Fri, 25 Jul 2025 07:28:42 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v6] In-Reply-To: References: Message-ID: > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is > relative smaller than that of `fromLong`. So this patch does the conversion for these cases. > > The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. > > Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. > > This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. > > As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like > > VectorMaskToLong (VectorLongToMask x) => x > > > Hence, this patch also added the following optimizations: > > VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > > VectorMaskCast (VectorMaskCast x) => x > > And we can see noticeable performance improvement with the above optimizations for floating-point types. > > Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 > microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 > microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 > microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 > microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 > microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 > microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 > microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 > > > Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double... erifan has updated the pull request incrementally with one additional commit since the last revision: Add an assertion ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25793/files - new: https://git.openjdk.org/jdk/pull/25793/files/6ae43e17..4ffc8d91 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=04-05 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/25793.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793 PR: https://git.openjdk.org/jdk/pull/25793 From jbhateja at openjdk.org Fri Jul 25 07:28:42 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 25 Jul 2025 07:28:42 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v6] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 07:24:28 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Add an assertion Still looks good. src/hotspot/share/opto/vectornode.cpp line 1989: > 1987: if (in1->Opcode() == Op_VectorStoreMask) { > 1988: in1 = in1->in(1); > 1989: assert(!in1->bottom_type()->isa_vectmask(), "sanity"); Assertion should precede before any other statement in the block :-) ------------- Marked as reviewed by jbhateja (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/25793#pullrequestreview-3054381237 PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2230359167 From duke at openjdk.org Fri Jul 25 07:28:43 2025 From: duke at openjdk.org (erifan) Date: Fri, 25 Jul 2025 07:28:43 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v5] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 03:35:11 GMT, Jatin Bhateja wrote: >> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: >> >> - Add JMH benchmarks for cast chain transformation >> - Merge branch 'master' into JDK-8356760 >> - Refactor the implementation >> >> Do the convertion in C2's IGVN phase to cover more cases. >> - Merge branch 'master' into JDK-8356760 >> - Simplify the test code >> - Address some review comments >> >> Add support for the following patterns: >> toLong(maskAll(true)) => (-1ULL >> (64 -vlen)) >> toLong(maskAll(false)) => 0 >> >> And add more test cases. >> - Merge branch 'master' into JDK-8356760 >> - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases >> >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would >> set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent >> to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is >> relative smaller than that of `fromLong`. This patch does the conversion >> for these cases if `l` is a compile time constant. >> >> And this conversion also enables further optimizations that recognize >> maskAll patterns, see [1]. >> >> Some JTReg test cases are added to ensure the optimization is effective. >> >> I tried many different ways to write a JMH benchmark, but failed. Since >> the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific >> compile-time constant, the statement will be hoisted out of the loop. >> If we don't use a loop, the hotspot will become other instructions, and >> no obvious performance change was observed. However, combined with the >> optimization of [1], we can observe a performance improvement of about >> 7% on both aarch64 and x64. >> >> The patch was tested on both aarch64 and x64, all of tier1 tier2 and >> tier3 tests passed. >> >> [1] https://github.com/openjdk/jdk/pull/24674 > > src/hotspot/share/opto/vectornode.cpp line 1986: > >> 1984: Node* VectorMaskToLongNode::Ideal_MaskAll(PhaseGVN* phase) { >> 1985: Node* in1 = in(1); >> 1986: // VectorMaskToLong follows a VectorStoreMask if predicate is not supported. > > It's always good to add an assertion check for coding assumptions. Done, thanks! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2230351518 From aph at openjdk.org Fri Jul 25 08:05:55 2025 From: aph at openjdk.org (Andrew Haley) Date: Fri, 25 Jul 2025 08:05:55 GMT Subject: RFR: 8360654: AArch64: Remove redundant dmb from C1 compareAndSet [v2] In-Reply-To: References: <3ZAHrkER2pU6L346Y40fDkndJhkFGjBrQQ4xX7cx80w=.527c87b5-f2ec-43eb-be68-d0b802c76940@github.com> Message-ID: On Thu, 24 Jul 2025 10:46:00 GMT, Samuel Chee wrote: > I have been unable to find any particular use patterns which relies on the existence of these trailing dmbs, so it does not seem necessary to add the trailingDMB option. Although would like to hear your thoughts on the issue. Maybe simply move the `dmb` after the non-LSE ldxr/stxr logic, then. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26000#discussion_r2230446292 From mli at openjdk.org Fri Jul 25 08:34:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 25 Jul 2025 08:34:55 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v4] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 03:01:08 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> remove NativeFarCall/RelocCall > > src/hotspot/cpu/riscv/nativeInst_riscv.hpp line 114: > >> 112: // call instructions (used to manipulate inline caches, primitive & >> 113: // DSO calls, etc.). >> 114: // On riscv, NativeCall is a reloc call. > > Suggestion: `NativeCall is reloc call on RISC-V. See MacroAssembler::reloc_call` fixed. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2230508486 From mli at openjdk.org Fri Jul 25 08:39:39 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 25 Jul 2025 08:39:39 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v4] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 03:13:34 GMT, Fei Yang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> remove NativeFarCall/RelocCall > > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 51: > >> 49: // NativeCall >> 50: // >> 51: // Implements direct far calling loading an address from the stub section version of reloc call. > > Suggestion: `// Implements indirect far call loading an address from the stub section of reloc call.` > > And I think this comment should be moved to immediately before definition of `MacroAssembler::reloc_call` [1]. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L4982 As it still has "far call" which is what we want to cleanup in this pr, and [1] explain the different types of call in more details, I'll just remove this comment to avoid misleading. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1313 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2230513589 From mli at openjdk.org Fri Jul 25 08:39:39 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 25 Jul 2025 08:39:39 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v5] In-Reply-To: References: Message-ID: > Hi, > Can you help to review this patch? > > By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. > NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. > Also add some comments and do some other simple cleanup. > > Thanks! Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26370/files - new: https://git.openjdk.org/jdk/pull/26370/files/37820220..724c44d2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26370&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26370&range=03-04 Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26370.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26370/head:pull/26370 PR: https://git.openjdk.org/jdk/pull/26370 From bkilambi at openjdk.org Fri Jul 25 08:58:41 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 25 Jul 2025 08:58:41 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v17] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: - Merge master - Refine comments in c2_MacroAssembler_aarch64.cpp - Addressed review comments to half the number of match rules - Updated x86 code. Patch contributed by @jatin-bhateja - Change match rule names to lowercase - Addressed review comments - x86_64: JTREG test update for x86. The patch is contributed by @jatin-bhateja - Addressed review comments - Merge master - code style issues fixed - ... and 7 more: https://git.openjdk.org/jdk/compare/518d5f4b...f79d2f00 ------------- Changes: https://git.openjdk.org/jdk/pull/23570/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=16 Stats: 969 lines in 13 files changed: 943 ins; 0 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From snatarajan at openjdk.org Fri Jul 25 09:01:12 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Fri, 25 Jul 2025 09:01:12 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v5] In-Reply-To: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> Message-ID: > **Issue** > Extreme values for BciProfileWidth flag such as `java -XX:BciProfileWidth=-1 -version` and `java -XX:BciProfileWidth=100000 -version `results in assert failure `assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. This is observed in a x86 machine. > > **Analysis** > On debugging the issue, I found that increasing the size of the interpreter using the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` prevented the above mentioned assert from failing for large values of BciProfileWidth. > > **Proposal** > Considering the fact that larger BciProfileWidth results in slower profiling, I have proposed a range between 0 to 5000 to restrict the value for BciProfileWidth for x86 machines. This maximum value is based on modifying the `InterpreterCodeSize` variable in `src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp` using the smallest `InterpreterCodeSize` for all the architectures. As for the lower bound, a value of -1 would be the same as 0, as this simply means no return bci's will be recorded in ret profile. > > **Issue in AArch64** > Additionally running the command `java -XX:BciProfileWidth= 10000 -version` (or larger values) results in a different failure `assert(offset_ok_for_immed(offset(), size)) failed: must be, was: 32768, 3` on an AArch64 machine.This is an issue of maximum offset for `ldr/str` in AArch64 which can be fixed using `form_address` as mentioned in [JDK-8342736](https://bugs.openjdk.org/browse/JDK-8342736). In my preliminary fix using `form_address` on AArch64 machine. I had to modify 3 `ldr` and 1 `str` instruction (in file `src/hotspot/cpu/aarch64/interp_masm_aarch64.cpp` at line number 926, 983, and 997). With this fix using `form_address`, `BciProfileWidth` works for maximum of 5000 after which it crashes with`assert(allocates2(pc)) failed: not in CodeBuffer memory: 0x0000772b63a7a3a0 <= 0x0000772b63b75159 <= 0x0000772b63b75158 `. Without this fix `BciProfileWidth` works for a maximum value of 1300. Currently, I have suggested to restrict the upper bound on AArch64 to 1000 instead of fixing it with `form_address`. > > **Question to reviewers** > Do you think this is a reasonable fix ? For AArch64 do you suggest fixing using `form_address` ? If yes, do I fix it under this PR or create another one ? Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: addressing review comment by adding intx flag ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26139/files - new: https://git.openjdk.org/jdk/pull/26139/files/2d0084ba..2fc4b0b7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26139&range=03-04 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26139.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26139/head:pull/26139 PR: https://git.openjdk.org/jdk/pull/26139 From fyang at openjdk.org Fri Jul 25 09:05:54 2025 From: fyang at openjdk.org (Fei Yang) Date: Fri, 25 Jul 2025 09:05:54 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v5] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 08:39:39 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. >> NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. >> Also add some comments and do some other simple cleanup. >> >> Thanks! > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > comments Thanks! ------------- Marked as reviewed by fyang (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26370#pullrequestreview-3054670513 From mli at openjdk.org Fri Jul 25 09:05:54 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 25 Jul 2025 09:05:54 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v5] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 09:02:30 GMT, Fei Yang wrote: > Thanks! Thank you for reviewing! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26370#issuecomment-3116991764 From snatarajan at openjdk.org Fri Jul 25 09:15:57 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Fri, 25 Jul 2025 09:15:57 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v4] In-Reply-To: <51aYnCiXel-vz4Zu40K08E1lyBtX5JXD8PXoCr5wWUE=.15def8e4-f7c3-42ae-976e-f79ed7415bfa@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> <_preMnRE0tqL476Pb8bPPfkixInRa-ZH5Qom7W70AW4=.a71e36da-d0e1-44e4-a3fe-9091460b813f@github.com> <51aYnCiXel-vz4Zu40K08E1lyBtX5JXD8PXoCr5wWUE=.15def8e4-f7c3-42ae-976e-f79ed7415bfa@github.com> Message-ID: On Thu, 24 Jul 2025 07:08:24 GMT, Damon Fenacci wrote: >> Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: >> >> fixing copyright > > src/hotspot/share/runtime/globals.hpp line 1356: > >> 1354: develop(int, BciProfileWidth, 2, \ >> 1355: "Number of return bci's to record in ret profile") \ >> 1356: range(0, AARCH64_ONLY(1000) NOT_AARCH64(5000)) \ > > I'm not too sure of the usual number of returns but even just 1000 sounds quite big as maximum. Do you think we could use that for all architectures? Thank you for the review. I have tested 1000 by reducing the `InterpreterCodeSize` to the smallest value in all the specified architecture in the source code on both AArch64 and x86. It works for 1000. Hence, I think it should work on all architectures. Do you propose I make it 1000 (or a lesser value) for all architecture ? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2230592823 From snatarajan at openjdk.org Fri Jul 25 09:15:59 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Fri, 25 Jul 2025 09:15:59 GMT Subject: RFR: 8358696: Assert with extreme values for -XX:BciProfileWidth [v5] In-Reply-To: <51aYnCiXel-vz4Zu40K08E1lyBtX5JXD8PXoCr5wWUE=.15def8e4-f7c3-42ae-976e-f79ed7415bfa@github.com> References: <5TRVeAXUQi6quM-nDWEij_jk6M5K2Vk31RA-Yjd8F2M=.5b63da45-93c3-4251-9e2e-3c64b7953919@github.com> <_preMnRE0tqL476Pb8bPPfkixInRa-ZH5Qom7W70AW4=.a71e36da-d0e1-44e4-a3fe-9091460b813f@github.com> <51aYnCiXel-vz4Zu40K08E1lyBtX5JXD8PXoCr5wWUE=.15def8e4-f7c3-42ae-976e-f79ed7415bfa@github.com> Message-ID: On Thu, 24 Jul 2025 06:59:52 GMT, Damon Fenacci wrote: >> Saranya Natarajan has updated the pull request incrementally with one additional commit since the last revision: >> >> addressing review comment by adding intx flag > > test/lib-test/jdk/test/whitebox/vm_flags/IntxTest.java line 39: > >> 37: public class IntxTest { >> 38: private static final String FLAG_NAME = "OnStackReplacePercentage"; >> 39: private static final String FLAG_DEBUG_NAME = "BciProfileWidth"; > > Maybe we might want use another `intx` flag instead of just removing this (just to keep testing the WhiteBox) I addressed this comment by adding `BinarySwitchThreshold` intx develop flag now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26139#discussion_r2230595014 From bkilambi at openjdk.org Fri Jul 25 09:17:19 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 25 Jul 2025 09:17:19 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v18] In-Reply-To: References: Message-ID: > This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. > > It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. > > For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. > > For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. > > This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. > > Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - > > > Benchmark (size) Mode Cnt Gain > SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 > SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 > SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 > SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 > SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 > SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 > SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 > SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 > SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 > SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 > SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 > > > Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: Refine comments in the ad file ------------- Changes: - all: https://git.openjdk.org/jdk/pull/23570/files - new: https://git.openjdk.org/jdk/pull/23570/files/f79d2f00..3675bf34 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=17 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=16-17 Stats: 16 lines in 2 files changed: 6 ins; 2 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/23570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570 PR: https://git.openjdk.org/jdk/pull/23570 From bkilambi at openjdk.org Fri Jul 25 09:17:21 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 25 Jul 2025 09:17:21 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v17] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 08:58:41 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 17 commits: > > - Merge master > - Refine comments in c2_MacroAssembler_aarch64.cpp > - Addressed review comments to half the number of match rules > - Updated x86 code. Patch contributed by @jatin-bhateja > - Change match rule names to lowercase > - Addressed review comments > - x86_64: JTREG test update for x86. The patch is contributed by @jatin-bhateja > - Addressed review comments > - Merge master > - code style issues fixed > - ... and 7 more: https://git.openjdk.org/jdk/compare/518d5f4b...f79d2f00 Hi @theRealAph I have refined my comments. Could I please get another review? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-3117022123 From duke at openjdk.org Fri Jul 25 09:27:19 2025 From: duke at openjdk.org (erifan) Date: Fri, 25 Jul 2025 09:27:19 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v7] In-Reply-To: References: Message-ID: > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is > relative smaller than that of `fromLong`. So this patch does the conversion for these cases. > > The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. > > Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. > > This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. > > As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like > > VectorMaskToLong (VectorLongToMask x) => x > > > Hence, this patch also added the following optimizations: > > VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > > VectorMaskCast (VectorMaskCast x) => x > > And we can see noticeable performance improvement with the above optimizations for floating-point types. > > Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 > microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 > microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 > microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 > microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 > microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 > microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 > microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 > > > Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double... erifan has updated the pull request incrementally with one additional commit since the last revision: Move the assertion to the beginning of the code block ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25793/files - new: https://git.openjdk.org/jdk/pull/25793/files/4ffc8d91..8418ebdd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=05-06 Stats: 2 lines in 1 file changed: 1 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/25793.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793 PR: https://git.openjdk.org/jdk/pull/25793 From duke at openjdk.org Fri Jul 25 09:27:20 2025 From: duke at openjdk.org (erifan) Date: Fri, 25 Jul 2025 09:27:20 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v6] In-Reply-To: References: Message-ID: <_LIu4LYcFUSkUTpCaLX4zA8f_xWGSV2lW917o6YEp40=.b4b7476a-ff51-4194-b691-94b6b35490c6@github.com> On Fri, 25 Jul 2025 07:21:38 GMT, Jatin Bhateja wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Add an assertion > > src/hotspot/share/opto/vectornode.cpp line 1989: > >> 1987: if (in1->Opcode() == Op_VectorStoreMask) { >> 1988: in1 = in1->in(1); >> 1989: assert(!in1->bottom_type()->isa_vectmask(), "sanity"); > > Assertion should precede before any other statement in the block :-) Done, thanks~ ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2230615716 From aph at openjdk.org Fri Jul 25 09:29:00 2025 From: aph at openjdk.org (Andrew Haley) Date: Fri, 25 Jul 2025 09:29:00 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v18] In-Reply-To: References: Message-ID: <7tKfqCZHB1fAcrN7hU2mVZBrAfE2XkMUa5M-fG2dERc=.9a40f17a-3860-4c7b-bc22-73480865276f@github.com> On Fri, 25 Jul 2025 09:17:19 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in the ad file OK, that looks like a good job. You'll need another reviewer. ------------- Marked as reviewed by aph (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-3054740405 From bkilambi at openjdk.org Fri Jul 25 10:06:01 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Fri, 25 Jul 2025 10:06:01 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v18] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 09:17:19 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in the ad file Hi @jatin-bhateja could I ask for your review for the x86 part please? I also fixed a minor merge conflict in the latest merge which is related to x86. ------------- PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-3117161192 From jbhateja at openjdk.org Fri Jul 25 13:50:55 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 25 Jul 2025 13:50:55 GMT Subject: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction Message-ID: Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction. It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails. Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java). Vector API jtreg tests pass at AVX level 2, remaining validation in progress. Performance numbers: System : 13th Gen Intel(R) Core(TM) i3-1315U Baseline: Benchmark (size) Mode Cnt Score Error Units VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 9444.444 ops/ms VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 10009.319 ops/ms VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 9081.926 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 6085.825 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 6505.378 ops/ms VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 6204.489 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 1651.334 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 1642.784 ops/ms VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 1474.808 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 10399.394 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 10502.894 ops/ms VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 2 9756.573 ops/ms With opt: Benchmark (size) Mode Cnt Score Error Units VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 34122.435 ops/ms VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 33281.868 ops/ms VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 9345.154 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 8283.247 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 8510.695 ops/ms VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 5626.367 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 960.958 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 4155.801 ops/ms VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 1465.953 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 32748.061 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 33674.408 ops/ms VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 2 9346.148 ops/ms Please share your feedback. Best Regards, Jatin ------------- Commit messages: - Fixes for failing regressions - Optimizing AVX2 backend and some re-factoring - new benchmark - Merge branch 'master' of https://github.com/openjdk/jdk into JDK-8303762 - 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction Changes: https://git.openjdk.org/jdk/pull/24104/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24104&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8303762 Stats: 747 lines in 32 files changed: 664 ins; 0 del; 83 mod Patch: https://git.openjdk.org/jdk/pull/24104.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24104/head:pull/24104 PR: https://git.openjdk.org/jdk/pull/24104 From jbhateja at openjdk.org Fri Jul 25 13:50:56 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 25 Jul 2025 13:50:56 GMT Subject: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction In-Reply-To: References: Message-ID: On Tue, 18 Mar 2025 20:51:46 GMT, Jatin Bhateja wrote: > Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction. > It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails. > > Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java). > > Vector API jtreg tests pass at AVX level 2, remaining validation in progress. > > Performance numbers: > > > System : 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark (size) Mode Cnt Score Error Units > VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 9444.444 ops/ms > VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 10009.319 ops/ms > VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 9081.926 ops/ms > VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 6085.825 ops/ms > VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 6505.378 ops/ms > VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 6204.489 ops/ms > VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 1651.334 ops/ms > VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 1642.784 ops/ms > VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 1474.808 ops/ms > VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 10399.394 ops/ms > VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 10502.894 ops/ms > VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 ... Performance after AVX2 backend modifications Benchmark (size) Mode Cnt Score Error Units VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 51644.530 ops/ms VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 48171.079 ops/ms VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 9662.306 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 14358.347 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 14619.920 ops/ms VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 6675.824 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 818.911 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 4778.321 ops/ms VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 1612.264 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 35961.146 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 39072.170 ops/ms VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 2 11209.685 ops/ms ------------- PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3116214722 From chagedorn at openjdk.org Fri Jul 25 13:55:55 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Fri, 25 Jul 2025 13:55:55 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v2] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Thu, 24 Jul 2025 08:41:52 GMT, Roland Westrelin wrote: >> A node in a pre loop only has uses out of the loop dominated by the >> loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control >> to the loop exit projection. A range check in the main loop has this >> node as input (through a chain of some other nodes). Range check >> elimination needs to update the exit condition of the pre loop with an >> expression that depends on the node pinned on its exit: that's >> impossible and the assert fires. This is a variant of 8314024 (this >> one was for a node with uses out of the pre loop on multiple paths). I >> propose the same fix: leave the node with control in the pre loop in >> this case. > > Roland Westrelin has updated the pull request incrementally with four additional commits since the last revision: > > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE3.java > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java > > Co-authored-by: Christian Hagedorn > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn > - Update test/hotspot/jtreg/compiler/rangechecks/TestSunkRangeFromPreLoopRCE2.java > > Co-authored-by: Christian Hagedorn Thanks for the update, testing looked good! src/hotspot/share/opto/loopopts.cpp line 1926: > 1924: } > 1925: > 1926: // Sinking a node from a pre loop to its main loop pins the node between the pre and main loops. If that node is input Suggestion: // Sinking a node from a pre loop to its main loop pins the node between the pre and main loops. If that node is input ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26424#pullrequestreview-3055490801 PR Review Comment: https://git.openjdk.org/jdk/pull/26424#discussion_r2231144292 From dfenacci at openjdk.org Fri Jul 25 14:25:54 2025 From: dfenacci at openjdk.org (Damon Fenacci) Date: Fri, 25 Jul 2025 14:25:54 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" In-Reply-To: References: Message-ID: <5JIvw451TX46AxSet-I-cjt0STj5riAlK4ajGDvIWrI=.2de4150e-a259-49c6-a628-3a0675e723fa@github.com> On Thu, 24 Jul 2025 15:10:37 GMT, Guanqiang Han wrote: > I'm able to consistently reproduce the problem using the following command line and test program ? > > java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java > > import java.util.Arrays; > public class Test{ > public static void main(String[] args) { > System.out.println("begin"); > byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > System.out.println(Arrays.equals(arr1, arr2)); > System.out.println("end"); > } > } > > From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). > > In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch > Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. > > In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. > > Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. > > A reference to the relevant code paths is provided below : > image1 > image2 > > On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. > > However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size classification remains single_size regardless. > > This classification... Thanks for looking into this @hgqxjj. Since we have a failing test, I think it would be nice to add a simple regression test. src/hotspot/share/c1/c1_LIR.hpp line 430: > 428: int single_stack_ix() const { assert(is_single_stack() && !is_virtual(), "type check"); return (int)data(); } > 429: int double_stack_ix() const { assert(is_double_stack() && !is_virtual(), "type check"); return (int)data(); } > 430: int stack_ix() const { assert((is_double_stack() || is_single_stack()) && !is_virtual(), "type check"); return (int)data(); } Minor thing, but I would follow the alignment of the other methods. ------------- PR Review: https://git.openjdk.org/jdk/pull/26462#pullrequestreview-3055586948 PR Review Comment: https://git.openjdk.org/jdk/pull/26462#discussion_r2231205680 From thartmann at openjdk.org Fri Jul 25 14:42:53 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 25 Jul 2025 14:42:53 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 15:10:37 GMT, Guanqiang Han wrote: > I'm able to consistently reproduce the problem using the following command line and test program ? > > java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java > > import java.util.Arrays; > public class Test{ > public static void main(String[] args) { > System.out.println("begin"); > byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > System.out.println(Arrays.equals(arr1, arr2)); > System.out.println("end"); > } > } > > From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). > > In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch > Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. > > In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. > > Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. > > A reference to the relevant code paths is provided below : > image1 > image2 > > On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. > > However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size classification remains single_size regardless. > > This classification... Thanks for looking into this! When I run your test with `java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java` we still crash when compiling `java.lang.invoke.LambdaFormEditor::putInCache` and if I restrict compilation to your test method via `-XX:CompileCommand=compileonly,Test::*`, the issue does not reproduce anymore. Could you please add a targeted regression test for this issue? ------------- PR Review: https://git.openjdk.org/jdk/pull/26462#pullrequestreview-3055704956 From thartmann at openjdk.org Fri Jul 25 14:47:53 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 25 Jul 2025 14:47:53 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 15:10:37 GMT, Guanqiang Han wrote: > I'm able to consistently reproduce the problem using the following command line and test program ? > > java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java > > import java.util.Arrays; > public class Test{ > public static void main(String[] args) { > System.out.println("begin"); > byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > System.out.println(Arrays.equals(arr1, arr2)); > System.out.println("end"); > } > } > > From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). > > In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch > Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. > > In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. > > Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. > > A reference to the relevant code paths is provided below : > image1 > image2 > > On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. > > However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size classification remains single_size regardless. > > This classification... +1 to what Dean suggested. I think other intrinsics are affected by this as well though, for example: https://github.com/openjdk/jdk/blob/b1fa1ecc988fb07f191892a459625c2c8f2de3b5/src/hotspot/cpu/x86/c1_LIRGenerator_x86.cpp#L953-L962 Also, what about other platforms than x86? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26462#issuecomment-3118233262 From roland at openjdk.org Fri Jul 25 14:58:47 2025 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 25 Jul 2025 14:58:47 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v3] In-Reply-To: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: > A node in a pre loop only has uses out of the loop dominated by the > loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control > to the loop exit projection. A range check in the main loop has this > node as input (through a chain of some other nodes). Range check > elimination needs to update the exit condition of the pre loop with an > expression that depends on the node pinned on its exit: that's > impossible and the assert fires. This is a variant of 8314024 (this > one was for a node with uses out of the pre loop on multiple paths). I > propose the same fix: leave the node with control in the pre loop in > this case. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26424/files - new: https://git.openjdk.org/jdk/pull/26424/files/2140c98d..1b658c4b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26424&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26424.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26424/head:pull/26424 PR: https://git.openjdk.org/jdk/pull/26424 From fjiang at openjdk.org Fri Jul 25 15:38:59 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Fri, 25 Jul 2025 15:38:59 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v5] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 08:39:39 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. >> NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. >> Finally, choose to remove the NativeFarCall and delegation from NativeCall to NativeFarCall, and move all the implementation to NativeCall itself. >> Also add some comments and do some other simple cleanup. >> >> Thanks! > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > comments Thanks for the cleanup! I have one minor comment. src/hotspot/cpu/riscv/nativeInst_riscv.hpp line 118: > 116: // private: when common code is using byte_size() > 117: private: > 118: enum { I see the enum of NativeFarCall was named as `RISCV_specific_constants`, do we need this for NativeCall? ------------- PR Review: https://git.openjdk.org/jdk/pull/26370#pullrequestreview-3055942557 PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2231440772 From epeter at openjdk.org Fri Jul 25 17:46:58 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 25 Jul 2025 17:46:58 GMT Subject: RFR: 8324751: C2 SuperWord: Aliasing Analysis runtime check In-Reply-To: <2r_uZpbgYMypJFPIgI_t3NuTg1_S40mbeGrsvqi7IvE=.c4771662-0c55-4c02-96bb-99d5cfdb3697@github.com> References: <2r_uZpbgYMypJFPIgI_t3NuTg1_S40mbeGrsvqi7IvE=.c4771662-0c55-4c02-96bb-99d5cfdb3697@github.com> Message-ID: On Thu, 27 Mar 2025 13:00:20 GMT, Emanuel Peter wrote: > This is a big patch, but about 3.5k lines are tests. And a large part of the VM changes is comments / proofs. > > I am adding a dynamic (runtime) aliasing check to the auto-vectorizer (SuperWord). We use the infrastructure from https://github.com/openjdk/jdk/pull/22016: > - Use the auto-vectorization `predicate` when available: we speculate that there is no aliasing, else we trap and re-compile without the predicate. > - If the predicate is not available, we use `multiversioning`, i.e. we have a `fast_loop` where there is no aliasing, and hence vectorization. And a `slow_loop` if the check fails, with no vectorization. > > -------------------------- > > **Where to start reviewing** > > - `src/hotspot/share/opto/mempointer.hpp`: > - Read the class comment for `MemPointerRawSummand`. > - Familiarize yourself with the `MemPointer Linearity Corrolary`. We need it for the proofs of the aliasing runtime checks. > > - `src/hotspot/share/opto/vectorization.cpp`: > - Read the explanations and proofs above `VPointer::can_make_speculative_aliasing_check_with`. It explains how the aliasing runtime check works. > > - `src/hotspot/share/opto/vtransform.hpp`: > - Understand the difference between weak and strong edges. > > If you need to see some examples, then look at the tests: > - `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasing.java`: simple array cases. IR rules that check for vectors and in somecases if we used multiversioning. > - `test/micro/org/openjdk/bench/vm/compiler/VectorAliasing.java`: the miro-benchmarks I show below. Simple array cases. > - `test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegmentAliasing.java`: a bit advanced, but similar cases. > - `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java`: very large and rather compliex. Generates random loops, some with and some without aliasing at runtime. IR verification, but mostly currently only for array cases, MemorySegment cases have some issues (see comments). > -------------------------- > > **Details** > > Most fundamentally: > - I had to refactor / extend `MemPointer` so that we have access to `MemPointerRawSummand`s. > - These raw summands us to reconstruct the `VPointer` at any `iv` value with `VPointer::make_pointer_expression(Node* iv_value)`. > - With the raw summands, a pointer may look like this: `p = base + ConvI2L(x + 2) + ConvI2L(y + 2)` > - With "regular" summands, this gets simplified to `p = base + 4L +ConvI2L(x) + ConvI2L(y)` > - For aliasing analysis (adjacency and overlap), the "regu... Still hoping for reviewers :) ------------- PR Comment: https://git.openjdk.org/jdk/pull/24278#issuecomment-3119727592 From mli at openjdk.org Fri Jul 25 19:39:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Fri, 25 Jul 2025 19:39:55 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v5] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 15:33:56 GMT, Feilong Jiang wrote: >> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: >> >> comments > > src/hotspot/cpu/riscv/nativeInst_riscv.hpp line 118: > >> 116: // private: when common code is using byte_size() >> 117: private: >> 118: enum { > > I see the enum of NativeFarCall was named as `RISCV_specific_constants`, do we need this for NativeCall? Thank you having a look. Seems not, there is no name for this enum in orignal code . And in this file some enums have names, some not, but seems either way is fine, although I think the names are all redundant here. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26370#discussion_r2231880820 From jbhateja at openjdk.org Fri Jul 25 20:09:40 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 25 Jul 2025 20:09:40 GMT Subject: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2] In-Reply-To: References: Message-ID: > Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction. > It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails. > > Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java). > > Vector API jtreg tests pass at AVX level 2, remaining validation in progress. > > Performance numbers: > > > System : 13th Gen Intel(R) Core(TM) i3-1315U > > Baseline: > Benchmark (size) Mode Cnt Score Error Units > VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 9444.444 ops/ms > VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 10009.319 ops/ms > VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 9081.926 ops/ms > VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 6085.825 ops/ms > VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 6505.378 ops/ms > VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 6204.489 ops/ms > VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 1651.334 ops/ms > VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 1642.784 ops/ms > VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 1474.808 ops/ms > VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 10399.394 ops/ms > VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 10502.894 ops/ms > VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 ... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Updating predicate checks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/24104/files - new: https://git.openjdk.org/jdk/pull/24104/files/b2e93434..04be59a6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=24104&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=24104&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/24104.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/24104/head:pull/24104 PR: https://git.openjdk.org/jdk/pull/24104 From dlong at openjdk.org Fri Jul 25 21:13:58 2025 From: dlong at openjdk.org (Dean Long) Date: Fri, 25 Jul 2025 21:13:58 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 03:23:38 GMT, Guanqiang Han wrote: >> I think it is good to detect mismatches between T_LONG and T_ADDRESS, so I'd rather not relax the checks. Why not fix >> do_vectorizedMismatch() to use new_register(T_ADDRESS)? And maybe file a separate RFE to cleanup this confusion that new_pointer_register() causes. > > @dean-long Thanks for the feedback! > Initially, I also considered modifying do_vectorizedMismatch() to use new_register(T_ADDRESS), as you suggested. However, I found that this change would trigger a series of follow-up modifications. as shown below: > image3 > image4 > That?s why I opted for a more localized fix . I believe this is still a reasonable compromise. On 64-bit platforms, both T_ADDRESS and T_LONG are 64-bit wide, and general-purpose registers are capable of holding either type. Moreover, the code already uses movptr for moving 64-bit wide data , as shown below: > image5 > So semantically, this modification in PR seems safe and practical in this context. > That said, I fully agree that the current treatment of new_pointer_register() is a bit confusing, If you, or other experts familiar with this area, believe the RFE is reasonable and it gets opened, I?d be happy to take on the implementation. > Thanks again for your insights, and I look forward to your feedback. @hgqxjj , I wasn't suggesting changing the new_pointer_register() implementation to use T_ADDRESS at this time, but to change intrinsics that call signature.append(T_ADDRESS) to use new_register(T_ADDRESS) for the register instead of with new_pointer_register(). As @TobiHartmann pointed out, we should fix all the intrinsics that are using signature.append(T_ADDRESS). ------------- PR Comment: https://git.openjdk.org/jdk/pull/26462#issuecomment-3120372534 From fjiang at openjdk.org Sat Jul 26 00:25:54 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Sat, 26 Jul 2025 00:25:54 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v5] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 08:39:39 GMT, Hamlin Li wrote: >> Hi, >> Can you help to review this patch? >> >> By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. >> NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. >> Finally, choose to remove the NativeFarCall and delegation from NativeCall to NativeFarCall, and move all the implementation to NativeCall itself. >> Also add some comments and do some other simple cleanup. >> >> Thanks! > > Hamlin Li has updated the pull request incrementally with one additional commit since the last revision: > > comments Marked as reviewed by fjiang (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26370#pullrequestreview-3057183153 From dlong at openjdk.org Sat Jul 26 01:40:08 2025 From: dlong at openjdk.org (Dean Long) Date: Sat, 26 Jul 2025 01:40:08 GMT Subject: RFR: 8361376: Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64 [v2] In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 18:51:22 GMT, Dean Long wrote: >> This PR removes the recently added lock around set_guard_value, using instead Atomic::cmpxchg to atomically update bit-fields of the guard value. Further, it takes a fast-path that uses the previous direct store when at a safepoint. Combined, these changes should get us back to almost where we were before in terms of overhead. If necessary, we could go even further and allow make_not_entrant() to perform a direct byte store, leaving 24 bits for the guard value. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > remove NMethodEntryBarrier_lock Unfortunately, I am still seeing a small 1% regression in Renaissance-NaiveBayes with ZGC. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26399#issuecomment-3120938875 From ghan at openjdk.org Sun Jul 27 10:06:53 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Sun, 27 Jul 2025 10:06:53 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" In-Reply-To: <5JIvw451TX46AxSet-I-cjt0STj5riAlK4ajGDvIWrI=.2de4150e-a259-49c6-a628-3a0675e723fa@github.com> References: <5JIvw451TX46AxSet-I-cjt0STj5riAlK4ajGDvIWrI=.2de4150e-a259-49c6-a628-3a0675e723fa@github.com> Message-ID: On Fri, 25 Jul 2025 14:19:13 GMT, Damon Fenacci wrote: >> I'm able to consistently reproduce the problem using the following command line and test program ? >> >> java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java >> >> import java.util.Arrays; >> public class Test{ >> public static void main(String[] args) { >> System.out.println("begin"); >> byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; >> byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; >> System.out.println(Arrays.equals(arr1, arr2)); >> System.out.println("end"); >> } >> } >> >> From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). >> >> In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch >> Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. >> >> In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. >> >> Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. >> >> A reference to the relevant code paths is provided below : >> image1 >> image2 >> >> On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. >> >> However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size class... > > src/hotspot/share/c1/c1_LIR.hpp line 430: > >> 428: int single_stack_ix() const { assert(is_single_stack() && !is_virtual(), "type check"); return (int)data(); } >> 429: int double_stack_ix() const { assert(is_double_stack() && !is_virtual(), "type check"); return (int)data(); } >> 430: int stack_ix() const { assert((is_double_stack() || is_single_stack()) && !is_virtual(), "type check"); return (int)data(); } > > Minor thing, but I would follow the alignment of the other methods. @dafedafe Thanks for the feedback. I'll be more careful with these details next time. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26462#discussion_r2233894328 From ghan at openjdk.org Sun Jul 27 10:20:35 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Sun, 27 Jul 2025 10:20:35 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" [v2] In-Reply-To: References: Message-ID: > I'm able to consistently reproduce the problem using the following command line and test program ? > > java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java > > import java.util.Arrays; > public class Test{ > public static void main(String[] args) { > System.out.println("begin"); > byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > System.out.println(Arrays.equals(arr1, arr2)); > System.out.println("end"); > } > } > > From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). > > In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch > Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. > > In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. > > Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. > > A reference to the relevant code paths is provided below : > image1 > image2 > > On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. > > However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size classification remains single_size regardless. > > This classification... Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - add regression test - Merge remote-tracking branch 'upstream/master' into 8359235 - 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26462/files - new: https://git.openjdk.org/jdk/pull/26462/files/f4de477b..bc8b5c17 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26462&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26462&range=00-01 Stats: 611 lines in 64 files changed: 389 ins; 99 del; 123 mod Patch: https://git.openjdk.org/jdk/pull/26462.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26462/head:pull/26462 PR: https://git.openjdk.org/jdk/pull/26462 From ghan at openjdk.org Sun Jul 27 14:00:42 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Sun, 27 Jul 2025 14:00:42 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" [v3] In-Reply-To: References: Message-ID: <609eLCJRQp0h2hjIo_CD_K3E2CJ3GA9F0HGnAr5Ufk0=.f2d82969-1566-4ffe-bfd5-74b20bd5d417@github.com> > I'm able to consistently reproduce the problem using the following command line and test program ? > > java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java > > import java.util.Arrays; > public class Test{ > public static void main(String[] args) { > System.out.println("begin"); > byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > System.out.println(Arrays.equals(arr1, arr2)); > System.out.println("end"); > } > } > > From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). > > In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch > Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. > > In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. > > Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. > > A reference to the relevant code paths is provided below : > image1 > image2 > > On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. > > However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size classification remains single_size regardless. > > This classification... Guanqiang Han has updated the pull request incrementally with one additional commit since the last revision: Increase sleep time to ensure the method gets compiled ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26462/files - new: https://git.openjdk.org/jdk/pull/26462/files/bc8b5c17..611d2fd1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26462&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26462&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26462.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26462/head:pull/26462 PR: https://git.openjdk.org/jdk/pull/26462 From duke at openjdk.org Sun Jul 27 14:25:58 2025 From: duke at openjdk.org (duke) Date: Sun, 27 Jul 2025 14:25:58 GMT Subject: RFR: 8362596: RISC-V: Improve _vectorizedHashCode intrinsic In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 08:07:48 GMT, Yuri Gaevsky wrote: > This is a micro-optimization for RISC-V SpacemiT K1 CPU to fix [encountered performance regression](https://github.com/openjdk/jdk/pull/17413#issuecomment-3082664335). @ygaevsky Your change (at version 1d5cb89486b935e2f30365f66e6bf5afd2058424) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26409#issuecomment-3124451738 From duke at openjdk.org Sun Jul 27 14:57:59 2025 From: duke at openjdk.org (Yuri Gaevsky) Date: Sun, 27 Jul 2025 14:57:59 GMT Subject: Integrated: 8362596: RISC-V: Improve _vectorizedHashCode intrinsic In-Reply-To: References: Message-ID: On Mon, 21 Jul 2025 08:07:48 GMT, Yuri Gaevsky wrote: > This is a micro-optimization for RISC-V SpacemiT K1 CPU to fix [encountered performance regression](https://github.com/openjdk/jdk/pull/17413#issuecomment-3082664335). This pull request has now been integrated. Changeset: 4189fcba Author: Yuri Gaevsky Committer: Feilong Jiang URL: https://git.openjdk.org/jdk/commit/4189fcbac40943f3b26c3a01938837b4e4762285 Stats: 6 lines in 1 file changed: 1 ins; 3 del; 2 mod 8362596: RISC-V: Improve _vectorizedHashCode intrinsic Reviewed-by: fyang, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/26409 From ghan at openjdk.org Sun Jul 27 15:58:52 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Sun, 27 Jul 2025 15:58:52 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" [v3] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 14:39:54 GMT, Tobias Hartmann wrote: > Thanks for looking into this! > > When I run your test with `java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java` we still crash when compiling `java.lang.invoke.LambdaFormEditor::putInCache` and if I restrict compilation to your test method via `-XX:CompileCommand=compileonly,Test::*`, the issue does not reproduce anymore. Could you please add a targeted regression test for this issue? @TobiHartmann Thanks for the feedback! I think there might be a bit of a misunderstanding. The original test program I provided is actually meant to reproduce the issue when run with: "java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java",it usually takes about 10 minutes to trigger the crash. Using -XX:CompileCommand=compileonly,Test::* won?t reproduce the issue because it compiles only a small subset of methods, which doesn?t put enough pressure on the register allocator to cause the spill (stack2reg can not be called ). Let?s forget about the previous test ? I?ve redesigned a new one and already committed it. Feel free to give it a try when you have time! > +1 to what Dean suggested. I think other intrinsics are affected by this as well though, for example: > > https://github.com/openjdk/jdk/blob/b1fa1ecc988fb07f191892a459625c2c8f2de3b5/src/hotspot/cpu/x86/c1_LIRGenerator_x86.cpp#L953-L962 > > Also, what about other platforms than x86? @TobiHartmann Other methods such as do_update_CRC32 may have similar issues, but they are harder to reproduce. Fortunately, other architectures have not implemented do_vectorizedMismatch. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26462#issuecomment-3124508002 From ghan at openjdk.org Sun Jul 27 16:12:55 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Sun, 27 Jul 2025 16:12:55 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" [v3] In-Reply-To: References: Message-ID: On Sun, 27 Jul 2025 15:56:25 GMT, Guanqiang Han wrote: >> Thanks for looking into this! >> >> When I run your test with `java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java` we still crash when compiling `java.lang.invoke.LambdaFormEditor::putInCache` and if I restrict compilation to your test method via `-XX:CompileCommand=compileonly,Test::*`, the issue does not reproduce anymore. Could you please add a targeted regression test for this issue? > >> Thanks for looking into this! >> >> When I run your test with `java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java` we still crash when compiling `java.lang.invoke.LambdaFormEditor::putInCache` and if I restrict compilation to your test method via `-XX:CompileCommand=compileonly,Test::*`, the issue does not reproduce anymore. Could you please add a targeted regression test for this issue? > > @TobiHartmann Thanks for the feedback! I think there might be a bit of a misunderstanding. > The original test program I provided is actually meant to reproduce the issue when run with: "java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java",it usually takes about 10 minutes to trigger the crash. Using -XX:CompileCommand=compileonly,Test::* won?t reproduce the issue because it compiles only a small subset of methods, which doesn?t put enough pressure on the register allocator to cause the spill (stack2reg can not be called ). Let?s forget about the previous test ? I?ve redesigned a new one and already committed it. Feel free to give it a try when you have time! > > > >> +1 to what Dean suggested. I think other intrinsics are affected by this as well though, for example: >> >> https://github.com/openjdk/jdk/blob/b1fa1ecc988fb07f191892a459625c2c8f2de3b5/src/hotspot/cpu/x86/c1_LIRGenerator_x86.cpp#L953-L962 >> >> Also, what about other platforms than x86? > > @TobiHartmann Other methods such as do_update_CRC32 may have similar issues, but they are harder to reproduce. Fortunately, other architectures have not implemented do_vectorizedMismatch. > @hgqxjj , I wasn't suggesting changing the new_pointer_register() implementation to use T_ADDRESS at this time, but to change intrinsics that call signature.append(T_ADDRESS) to use new_register(T_ADDRESS) for the register instead of with new_pointer_register(). As @TobiHartmann pointed out, we should fix all the intrinsics that are using signature.append(T_ADDRESS). @dean-long Thanks for your suggestion! After reviewing the code again, I think your approach would work fine for the x86 architecture. However, for other architectures like aarch64, we would also need to modify the implementation of leal accordingly, since it checks the type of the target operand. The relevant code is as follows: https://github.com/openjdk/jdk/blob/4189fcbac40943f3b26c3a01938837b4e4762285/src/hotspot/cpu/aarch64/c1_LIRGenerator_aarch64.cpp#L985 https://github.com/openjdk/jdk/blob/4189fcbac40943f3b26c3a01938837b4e4762285/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp#L2826 https://github.com/openjdk/jdk/blob/4189fcbac40943f3b26c3a01938837b4e4762285/src/hotspot/share/c1/c1_LIR.cpp#L40 https://github.com/openjdk/jdk/blob/4189fcbac40943f3b26c3a01938837b4e4762285/src/hotspot/share/c1/c1_LIR.hpp#L431 T_ADDRESS is not double cpu? so i need to modify the implementation of leal accordingly . @dean-long @TobiHartmann Do you think this approach is okay, or do you have any other suggestions? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26462#issuecomment-3124515635 From haosun at openjdk.org Mon Jul 28 01:01:06 2025 From: haosun at openjdk.org (Hao Sun) Date: Mon, 28 Jul 2025 01:01:06 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v18] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 09:17:19 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in the ad file Marked as reviewed by haosun (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-3059765593 From jkarthikeyan at openjdk.org Mon Jul 28 02:39:02 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 28 Jul 2025 02:39:02 GMT Subject: RFR: 8362979: C2 fails with unexpected node in SuperWord truncation: CmpLTMask, RoundF Message-ID: Hi all, This is a fix for a debug assert failure in SuperWord truncation for `CmpLTMask` and `RoundF` nodes, as discovered by CTW in the linked JBS report. I've added the nodes to the switch, and added reduced test cases. I've made a similar fix for `RoundD` nodes as well. Thanks! ------------- Commit messages: - Fix truncation assert for CmpLTMask and rounding Changes: https://git.openjdk.org/jdk/pull/26494/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26494&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8362979 Stats: 57 lines in 3 files changed: 57 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26494.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26494/head:pull/26494 PR: https://git.openjdk.org/jdk/pull/26494 From chagedorn at openjdk.org Mon Jul 28 05:55:54 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 28 Jul 2025 05:55:54 GMT Subject: RFR: 8362979: C2 fails with unexpected node in SuperWord truncation: CmpLTMask, RoundF In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 02:34:25 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a fix for a debug assert failure in SuperWord truncation for `CmpLTMask` and `RoundF` nodes, as discovered by CTW in the linked JBS report. I've added the nodes to the switch, and added reduced test cases. I've made a similar fix for `RoundD` nodes as well. Thanks! Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26494#pullrequestreview-3060502412 From jbhateja at openjdk.org Mon Jul 28 05:55:55 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 28 Jul 2025 05:55:55 GMT Subject: RFR: 8303762: Optimize vector slice operation with constant index using VPALIGNR instruction [v2] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 20:09:40 GMT, Jatin Bhateja wrote: >> Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction. >> It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails. >> >> Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java). >> >> Vector API jtreg tests pass at AVX level 2, remaining validation in progress. >> >> Performance numbers: >> >> >> System : 13th Gen Intel(R) Core(TM) i3-1315U >> >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 2 9444.444 ops/ms >> VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 2 10009.319 ops/ms >> VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 2 9081.926 ops/ms >> VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 2 6085.825 ops/ms >> VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 2 6505.378 ops/ms >> VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 2 6204.489 ops/ms >> VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 2 1651.334 ops/ms >> VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 2 1642.784 ops/ms >> VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 2 1474.808 ops/ms >> VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 2 10399.394 ops/ms >> VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 2 10502.894 ops/ms >> VectorSliceB... > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > Updating predicate checks Performance on AVX512 machine Baseline: Benchmark (size) Mode Cnt Score Error Units VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 4 35741.780 ? 1561.065 ops/ms VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 4 35011.929 ? 5886.902 ops/ms VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 4 32366.844 ? 1489.449 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 4 10636.281 ? 608.705 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 4 10750.833 ? 328.997 ops/ms VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 4 10257.338 ? 2027.422 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 4 5362.330 ? 4199.651 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 4 4992.399 ? 6053.641 ops/ms VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 4 4941.258 ? 478.193 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 4 40432.828 ? 26672.673 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 4 41300.811 ? 34342.482 ops/ms VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 4 36958.309 ? 1899.676 ops/ms Withopt: Benchmark (size) Mode Cnt Score Error Units VectorSliceBenchmark.byteVectorSliceWithConstantIndex1 1024 thrpt 10 67936.711 ? 389.783 ops/ms VectorSliceBenchmark.byteVectorSliceWithConstantIndex2 1024 thrpt 10 70086.731 ? 5972.968 ops/ms VectorSliceBenchmark.byteVectorSliceWithVariableIndex 1024 thrpt 10 31879.187 ? 148.213 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex1 1024 thrpt 10 17676.883 ? 217.238 ops/ms VectorSliceBenchmark.intVectorSliceWithConstantIndex2 1024 thrpt 10 16983.007 ? 3988.548 ops/ms VectorSliceBenchmark.intVectorSliceWithVariableIndex 1024 thrpt 10 9851.266 ? 31.773 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex1 1024 thrpt 10 9194.216 ? 42.772 ops/ms VectorSliceBenchmark.longVectorSliceWithConstantIndex2 1024 thrpt 10 8411.738 ? 33.209 ops/ms VectorSliceBenchmark.longVectorSliceWithVariableIndex 1024 thrpt 10 5244.850 ? 12.214 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex1 1024 thrpt 10 61233.526 ? 20472.895 ops/ms VectorSliceBenchmark.shortVectorSliceWithConstantIndex2 1024 thrpt 10 61545.276 ? 20722.066 ops/ms VectorSliceBenchmark.shortVectorSliceWithVariableIndex 1024 thrpt 10 41208.718 ? 5374.829 ops/ms ------------- PR Comment: https://git.openjdk.org/jdk/pull/24104#issuecomment-3125629912 From chagedorn at openjdk.org Mon Jul 28 06:36:53 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 28 Jul 2025 06:36:53 GMT Subject: RFR: 8361702: C2: assert(is_dominator(compute_early_ctrl(limit, limit_ctrl), pre_end)) failed: node pinned on loop exit test? [v3] In-Reply-To: References: <1-3MDixhdwZEgDMpoAZckhK5_lFjygsKl4q1__tsCKs=.dffa9c0e-8ea1-4465-a1fc-6ad2dbcfe5db@github.com> Message-ID: On Fri, 25 Jul 2025 14:58:47 GMT, Roland Westrelin wrote: >> A node in a pre loop only has uses out of the loop dominated by the >> loop exit. `PhaseIdealLoop::try_sink_out_of_loop()` sets its control >> to the loop exit projection. A range check in the main loop has this >> node as input (through a chain of some other nodes). Range check >> elimination needs to update the exit condition of the pre loop with an >> expression that depends on the node pinned on its exit: that's >> impossible and the assert fires. This is a variant of 8314024 (this >> one was for a node with uses out of the pre loop on multiple paths). I >> propose the same fix: leave the node with control in the pre loop in >> this case. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Christian Hagedorn Marked as reviewed by chagedorn (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/26424#pullrequestreview-3060627024 From thartmann at openjdk.org Mon Jul 28 06:44:53 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 28 Jul 2025 06:44:53 GMT Subject: RFR: 8362979: C2 fails with unexpected node in SuperWord truncation: CmpLTMask, RoundF In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 02:34:25 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a fix for a debug assert failure in SuperWord truncation for `CmpLTMask` and `RoundF` nodes, as discovered by CTW in the linked JBS report. I've added the nodes to the switch, and added reduced test cases. I've made a similar fix for `RoundD` nodes as well. Thanks! Looks good to me too. Thanks for quickly jumping on this! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26494#pullrequestreview-3060659436 From mhaessig at openjdk.org Mon Jul 28 08:04:56 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 28 Jul 2025 08:04:56 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v2] In-Reply-To: References: Message-ID: <5AHjMamXpJNedikYebcuP6DhN7NM5Vg5xJgxDnxcV-s=.34299053-9cc9-4035-bcb0-0d23e521162b@github.com> On Thu, 24 Jul 2025 19:03:24 GMT, Dean Long wrote: >> src/hotspot/share/runtime/deoptimization.cpp line 847: >> >>> 845: >>> 846: #ifndef PRODUCT >>> 847: #ifdef ASSERT >> >> Why is both `NOT_PRODUCT` and `ASSERT` needed here? So far, I thought that `ASSERT` implies `NOT_PRODUCT`. > > Unfortunately, they are not the same, thanks to "optimized" builds. We can clean this up if optimizes builds get removed. See https://bugs.openjdk.org/browse/JDK-8183287. Ah, "optimized" is with neither `ASSERT` nor `PRODUCT` defined. Makes sense now. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2235198526 From mhaessig at openjdk.org Mon Jul 28 08:36:55 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 28 Jul 2025 08:36:55 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v7] In-Reply-To: References: Message-ID: <_gJoTNnBpV2Y2ENO9s153NWZeq_ujs40-zoyuZstOqM=.69d1d039-5022-4beb-ae79-7fc4193f3a11@github.com> On Thu, 24 Jul 2025 20:03:33 GMT, Dean Long wrote: >> The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > readability suggestion Thank you for addressing my comments. I have done another pass and it looks good to me. ------------- Marked as reviewed by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/26121#pullrequestreview-3061222726 From jsjolen at openjdk.org Mon Jul 28 08:41:06 2025 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Mon, 28 Jul 2025 08:41:06 GMT Subject: RFR: 8352112: [ubsan] hotspot/share/code/relocInfo.cpp:130:37: runtime error: applying non-zero offset 18446744073709551614 to null pointer [v2] In-Reply-To: References: Message-ID: On Wed, 19 Mar 2025 17:52:46 GMT, Vladimir Kozlov wrote: >> Before [JDK-8343789](https://bugs.openjdk.org/browse/JDK-8343789) `relocation_begin()` was never null even when there was no relocations - it pointed to the beginning of constant or code section in such case. It was used by relocation code to simplify code and avoid null checks. >> With that fix `relocation_begin()` points to address in `CodeBlob::_mutable_data` field which could be `nullptr` if there is no relocation and metadata. >> >> There easy fix is to avoid `nullptr` in `CodeBlob::_mutable_data`. We could do that similar to what we do for `nmethod::_immutable_data`: [nmethod.cpp#L1514](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/nmethod.cpp#L1514). >> >> Tested tier1-4, stress, xcomp. Verified with failed tests listed in bug report. > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Update field default setting I suspect that this change fixes the UBSAN issue but instead causes a runtime issue which NMT detects, see this bug: https://bugs.openjdk.org/browse/JDK-8361382 I added a caused-by link, but I'm not 100% sure that this is the case yet. src/hotspot/share/code/codeBlob.cpp line 156: > 154: } else { > 155: // We need unique and valid not null address > 156: assert(_mutable_data = blob_end(), "sanity"); Did this mean to assign the `_mutable_data`? I think it should be `==`. ------------- PR Review: https://git.openjdk.org/jdk/pull/24102#pullrequestreview-3061046981 PR Review Comment: https://git.openjdk.org/jdk/pull/24102#discussion_r2235188508 From mli at openjdk.org Mon Jul 28 08:44:02 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 28 Jul 2025 08:44:02 GMT Subject: RFR: 8362515: RISC-V: cleanup NativeFarCall [v5] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 15:36:36 GMT, Feilong Jiang wrote: > Thanks for the cleanup! I have one minor comment. Thank you! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26370#issuecomment-3126157777 From mli at openjdk.org Mon Jul 28 08:44:03 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 28 Jul 2025 08:44:03 GMT Subject: Integrated: 8362515: RISC-V: cleanup NativeFarCall In-Reply-To: References: Message-ID: <3esJOaI8GSNF45pD-JZCANQ5GsMDic8r_emp6rmkED8=.76b3804b-49ef-408e-81a1-a6c2cfeb288a@github.com> On Thu, 17 Jul 2025 14:17:45 GMT, Hamlin Li wrote: > Hi, > Can you help to review this patch? > > By https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L1270, there are far call, indirect call, reloc call. > NativeFarCall is in fact a reloc call, the name is confusing, better to rename it to RelocCall. > Finally, choose to remove the NativeFarCall and delegation from NativeCall to NativeFarCall, and move all the implementation to NativeCall itself. > Also add some comments and do some other simple cleanup. > > Thanks! This pull request has now been integrated. Changeset: 3e2d12d8 Author: Hamlin Li URL: https://git.openjdk.org/jdk/commit/3e2d12d85a35d9724c2ddf17a2dccf4b0866bc62 Stats: 139 lines in 2 files changed: 11 ins; 97 del; 31 mod 8362515: RISC-V: cleanup NativeFarCall Reviewed-by: fyang, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/26370 From bmaillard at openjdk.org Mon Jul 28 11:52:41 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 28 Jul 2025 11:52:41 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v2] In-Reply-To: References: Message-ID: <6d5-5z-1q5VZ3bY9xGKsAbiLbz4e8IySfI7NYXZOdS0=.9574728a-5afe-496c-a89b-480562cd96db@github.com> > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). > > The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: > - `ConvD2L->ConvL2D->ConvD2L` > - `ConvF2I->ConvI2F->ConvF2I` > - `ConvF2L->ConvL2F->ConvF2L` > - `ConvI2F->ConvF2I->ConvI2F` > > Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. > > This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) > > Thank you for reviewing! Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/c2/TestEliminateRedundantConversionSequences.java Co-authored-by: Christian Hagedorn ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26368/files - new: https://git.openjdk.org/jdk/pull/26368/files/9c49b040..0b3244fd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26368&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26368&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26368.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26368/head:pull/26368 PR: https://git.openjdk.org/jdk/pull/26368 From bmaillard at openjdk.org Mon Jul 28 11:58:12 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 28 Jul 2025 11:58:12 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v3] In-Reply-To: References: Message-ID: <1sVn4xoZ_PcWL36gmBVi_IBEaYO4AQzSXuJMFKygdvI=.5a54417c-ed5d-4e95-b997-bf7bacf673af@github.com> > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). > > The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: > - `ConvD2L->ConvL2D->ConvD2L` > - `ConvF2I->ConvI2F->ConvF2I` > - `ConvF2L->ConvL2F->ConvF2L` > - `ConvI2F->ConvF2I->ConvI2F` > > Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. > > This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) > > Thank you for reviewing! Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: 8359603: Reduce number of iterations in tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26368/files - new: https://git.openjdk.org/jdk/pull/26368/files/0b3244fd..10f79866 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26368&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26368&range=01-02 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26368.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26368/head:pull/26368 PR: https://git.openjdk.org/jdk/pull/26368 From bmaillard at openjdk.org Mon Jul 28 11:58:13 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 28 Jul 2025 11:58:13 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v3] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 08:44:36 GMT, Christian Hagedorn wrote: >> Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: >> >> 8359603: Reduce number of iterations in tests > > test/hotspot/jtreg/compiler/c2/TestEliminateRedundantConversionSequences.java line 94: > >> 92: >> 93: public static void main(String[] strArr) { >> 94: for (int i = 0; i < 50_000; ++i) { > > Do you really need 50000 iterations each? Would less also trigger the bug? I was able to reduce it to ~1550 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26368#discussion_r2236079880 From chagedorn at openjdk.org Mon Jul 28 11:58:56 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 28 Jul 2025 11:58:56 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v7] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 09:27:19 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Move the assertion to the beginning of the code block I'll give this a spin in our testing - will report the results back later. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3126886620 From bmaillard at openjdk.org Mon Jul 28 12:25:56 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 28 Jul 2025 12:25:56 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v3] In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 08:49:38 GMT, Christian Hagedorn wrote: >> Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: >> >> 8359603: Reduce number of iterations in tests > > src/hotspot/share/opto/phaseX.cpp line 2565: > >> 2563: // ConvF2I->ConvI2F->ConvF2I >> 2564: // ConvF2L->ConvL2F->ConvF2L >> 2565: // ConvI2F->ConvF2I->ConvI2F > > Another thought: Since this is an incomplete list of variations (especially missing, for example, the I2D version while the I2F version is here), should we leave a comment about not being able to trigger issues with the other versions? Otherwise, it could suggest that it was just forgotten. The notification issue that is solved here only happens with optimizations that have this chain pattern with three nodes (checking the input of the input) and this is specific to conversions that have a loss of precision. ConvI2D is not here because the `ConvI2D->ConvD2I` gets optimized to a NOP already (`ConvD2INode::Identity`). But there are also chains for which there is a known optimization and for which I was not able to trigger a missed optimization, so it would make sense to have mention this in any case. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26368#discussion_r2236184320 From fyang at openjdk.org Mon Jul 28 12:30:06 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Jul 2025 12:30:06 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call Message-ID: Hi, please consider this small change. JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. Testing on linux-riscv64: - [x] tier1-tier3 (release build) - [x] hs:tier1-hs:tier3 (fastdebug build) [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 ------------- Commit messages: - 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call Changes: https://git.openjdk.org/jdk/pull/26495/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8364150 Stats: 12 lines in 2 files changed: 1 ins; 7 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/26495.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26495/head:pull/26495 PR: https://git.openjdk.org/jdk/pull/26495 From bmaillard at openjdk.org Mon Jul 28 12:35:15 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 28 Jul 2025 12:35:15 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v4] In-Reply-To: References: Message-ID: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). > > The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: > - `ConvD2L->ConvL2D->ConvD2L` > - `ConvF2I->ConvI2F->ConvF2I` > - `ConvF2L->ConvL2F->ConvF2L` > - `ConvI2F->ConvF2I->ConvI2F` > > Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. > > This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) > > Thank you for reviewing! Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: 8359603: Add note ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26368/files - new: https://git.openjdk.org/jdk/pull/26368/files/10f79866..2e5efdcc Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26368&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26368&range=02-03 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26368.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26368/head:pull/26368 PR: https://git.openjdk.org/jdk/pull/26368 From bmaillard at openjdk.org Mon Jul 28 12:35:17 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Mon, 28 Jul 2025 12:35:17 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v4] In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 12:23:27 GMT, Beno?t Maillard wrote: >> src/hotspot/share/opto/phaseX.cpp line 2565: >> >>> 2563: // ConvF2I->ConvI2F->ConvF2I >>> 2564: // ConvF2L->ConvL2F->ConvF2L >>> 2565: // ConvI2F->ConvF2I->ConvI2F >> >> Another thought: Since this is an incomplete list of variations (especially missing, for example, the I2D version while the I2F version is here), should we leave a comment about not being able to trigger issues with the other versions? Otherwise, it could suggest that it was just forgotten. > > The notification issue that is solved here only happens with optimizations that have this chain pattern with three nodes (checking the input of the input) and this is specific to conversions that have a loss of precision. ConvI2D is not here because the `ConvI2D->ConvD2I` gets optimized to a NOP already (`ConvD2INode::Identity`). But there are also chains for which there is a known optimization and for which I was not able to trigger a missed optimization, so it would make sense to have mention this in any case. I have added a short note, let me know what you think! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26368#discussion_r2236219134 From bulasevich at openjdk.org Mon Jul 28 12:42:04 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Mon, 28 Jul 2025 12:42:04 GMT Subject: RFR: 8352112: [ubsan] hotspot/share/code/relocInfo.cpp:130:37: runtime error: applying non-zero offset 18446744073709551614 to null pointer [v2] In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 08:00:16 GMT, Johan Sj?len wrote: >> Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: >> >> Update field default setting > > src/hotspot/share/code/codeBlob.cpp line 156: > >> 154: } else { >> 155: // We need unique and valid not null address >> 156: assert(_mutable_data = blob_end(), "sanity"); > > Did this mean to assign the `_mutable_data`? I think it should be `==`. Right. This typo was fixed in https://github.com/openjdk/jdk/pull/26175 For now I do not see how this change is related with [JDK-8361382: NMT corruption](https://bugs.openjdk.org/browse/JDK-8361382) ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24102#discussion_r2236254069 From mli at openjdk.org Mon Jul 28 12:53:53 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 28 Jul 2025 12:53:53 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 04:05:20 GMT, Fei Yang wrote: > Hi, please consider this small change. > > JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. > > We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call > and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. > > Testing on linux-riscv64: > - [x] tier1-tier3 (release build) > - [x] hs:tier1-hs:tier3 (fastdebug build) > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 77: > 75: stub_addr = trampoline_stub_Relocation::get_trampoline_for(call_addr, code->as_nmethod()); > 76: assert(stub_addr != nullptr, "Sanity"); > 77: return stub_addr; Seems this line is not necessary. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2236316512 From fyang at openjdk.org Mon Jul 28 13:01:42 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Jul 2025 13:01:42 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v2] In-Reply-To: References: Message-ID: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> > Hi, please consider this small change. > > JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. > > We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call > and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. > > Testing on linux-riscv64: > - [x] tier1-tier3 (release build) > - [x] hs:tier1-hs:tier3 (fastdebug build) > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 Fei Yang has updated the pull request incrementally with one additional commit since the last revision: Comment ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26495/files - new: https://git.openjdk.org/jdk/pull/26495/files/05b69ad4..473b9fe1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26495.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26495/head:pull/26495 PR: https://git.openjdk.org/jdk/pull/26495 From mli at openjdk.org Mon Jul 28 13:01:43 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 28 Jul 2025 13:01:43 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v2] In-Reply-To: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> Message-ID: <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> On Mon, 28 Jul 2025 12:58:21 GMT, Fei Yang wrote: >> Hi, please consider this small change. >> >> JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. >> >> We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call >> and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. >> >> Testing on linux-riscv64: >> - [x] tier1-tier3 (release build) >> - [x] hs:tier1-hs:tier3 (fastdebug build) >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Comment Thanks for working on this! There are several comments below. src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 116: > 114: if (code->is_nmethod()) { > 115: assert(dest != nullptr, "Sanity"); > 116: MacroAssembler::pd_patch_instruction_size(call_addr, dest); `dest` here is the reloc call desitnation, ie. the dest stored in the stub, and it should be able to reach anywhere in the address space. The patch here should patch the `auipc + jalr` to the address of this stub, rather than `dest`? ------------- PR Review: https://git.openjdk.org/jdk/pull/26495#pullrequestreview-3062577961 PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2236343787 From fyang at openjdk.org Mon Jul 28 13:01:43 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Jul 2025 13:01:43 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v2] In-Reply-To: References: Message-ID: <-fTfStxKEpyiYk2WJ-w6kHhekWJZJK-H4DA03UhoVi8=.53308b2f-7f7d-4f39-ba8e-b2965fc7ce01@github.com> On Mon, 28 Jul 2025 12:51:27 GMT, Hamlin Li wrote: >> Fei Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> Comment > > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 77: > >> 75: stub_addr = trampoline_stub_Relocation::get_trampoline_for(call_addr, code->as_nmethod()); >> 76: assert(stub_addr != nullptr, "Sanity"); >> 77: return stub_addr; > > Seems this line is not necessary. Yes! I have removed this redundant return statement. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2236346532 From fyang at openjdk.org Mon Jul 28 13:13:57 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Jul 2025 13:13:57 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v2] In-Reply-To: <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> Message-ID: On Mon, 28 Jul 2025 12:57:41 GMT, Hamlin Li wrote: >> Fei Yang has updated the pull request incrementally with one additional commit since the last revision: >> >> Comment > > src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 116: > >> 114: if (code->is_nmethod()) { >> 115: assert(dest != nullptr, "Sanity"); >> 116: MacroAssembler::pd_patch_instruction_size(call_addr, dest); > > `dest` here is the reloc call desitnation, ie. the dest stored in the stub, and it should be able to reach anywhere in the address space. > The patch here should patch the `auipc + jalr` to the address of this stub, rather than `dest`? Yes, the `dest` param here holds the address of this stub. In `CallRelocation::fix_relocation_after_move`, we first get the `callee` address by calling `pd_call_destination` which delegates work to `NativeCall::reloc_destination` for a NativeCall. And we have modified `NativeCall::reloc_destination` to return the stub address in this PR at the same time. So `callee` will hold the stub address. Immediatedly after that, `callee` is passed to `pd_set_call_destination` which delegates work to `NativeCall::reloc_set_destination`. Make sense? void CallRelocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest) { // Usually a self-relative reference to an external routine. // On some platforms, the reference is absolute (not self-relative). // The enhanced use of pd_call_destination sorts this all out. address orig_addr = old_addr_for(addr(), src, dest); address callee = pd_call_destination(orig_addr); <=========== callee is stub address // Reassert the callee address, this time in the new copy of the code. pd_set_call_destination(callee); <=========== callee passed as param } ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2236392322 From fyang at openjdk.org Mon Jul 28 13:17:15 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Jul 2025 13:17:15 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v3] In-Reply-To: References: Message-ID: > Hi, please consider this small change. > > JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. > > We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call > and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. > > Testing on linux-riscv64: > - [x] tier1-tier3 (release build) > - [x] hs:tier1-hs:tier3 (fastdebug build) > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 Fei Yang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into JDK-8364150 - Comment - 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26495/files - new: https://git.openjdk.org/jdk/pull/26495/files/473b9fe1..01949774 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=01-02 Stats: 590 lines in 4 files changed: 332 ins; 220 del; 38 mod Patch: https://git.openjdk.org/jdk/pull/26495.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26495/head:pull/26495 PR: https://git.openjdk.org/jdk/pull/26495 From thartmann at openjdk.org Mon Jul 28 13:33:57 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 28 Jul 2025 13:33:57 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v4] In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 12:35:15 GMT, Beno?t Maillard wrote: >> This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). >> >> The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: >> - `ConvD2L->ConvL2D->ConvD2L` >> - `ConvF2I->ConvI2F->ConvF2I` >> - `ConvF2L->ConvL2F->ConvF2L` >> - `ConvI2F->ConvF2I->ConvI2F` >> >> Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. >> >> This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) >> - [x] tier1-3, plus some internal testing >> - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > 8359603: Add note Looks good to me! ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26368#pullrequestreview-3062756127 From mli at openjdk.org Mon Jul 28 13:36:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 28 Jul 2025 13:36:55 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v3] In-Reply-To: References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> Message-ID: On Mon, 28 Jul 2025 13:09:42 GMT, Fei Yang wrote: >> src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 116: >> >>> 114: if (code->is_nmethod()) { >>> 115: assert(dest != nullptr, "Sanity"); >>> 116: MacroAssembler::pd_patch_instruction_size(call_addr, dest); >> >> `dest` here is the reloc call desitnation, ie. the dest stored in the stub, and it should be able to reach anywhere in the address space. >> The patch here should patch the `auipc + jalr` to the address of this stub, rather than `dest`? > > Yes, the `dest` param here holds the address of this stub. > In `CallRelocation::fix_relocation_after_move`, we first get the `callee` address by calling `pd_call_destination` which delegates work to `NativeCall::reloc_destination` for a NativeCall. And we have modified `NativeCall::reloc_destination` to return the stub address in this PR at the same time. So `callee` will hold the stub address. Immediatedly after that, `callee` is passed to `pd_set_call_destination` which delegates work to `NativeCall::reloc_set_destination`. Make sense? > > > void CallRelocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest) { > // Usually a self-relative reference to an external routine. > // On some platforms, the reference is absolute (not self-relative). > // The enhanced use of pd_call_destination sorts this all out. > address orig_addr = old_addr_for(addr(), src, dest); > address callee = pd_call_destination(orig_addr); <=========== callee is stub address > // Reassert the callee address, this time in the new copy of the code. > pd_set_call_destination(callee); <=========== callee passed as param > } There is another call of `Relocation::pd_set_call_destination(address x)` from `CallRelocation::set_destination(address x)`, not sure if this `x` passed in from set_destination is also the stub addr? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2236501965 From mhaessig at openjdk.org Mon Jul 28 13:40:05 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Mon, 28 Jul 2025 13:40:05 GMT Subject: RFR: 8324751: C2 SuperWord: Aliasing Analysis runtime check In-Reply-To: <2r_uZpbgYMypJFPIgI_t3NuTg1_S40mbeGrsvqi7IvE=.c4771662-0c55-4c02-96bb-99d5cfdb3697@github.com> References: <2r_uZpbgYMypJFPIgI_t3NuTg1_S40mbeGrsvqi7IvE=.c4771662-0c55-4c02-96bb-99d5cfdb3697@github.com> Message-ID: <1jRz0k69pSoITg9V5DiMv7pYixyilnf68vOkwEm-34w=.b982d419-795e-445f-92f7-a3abfc76fa37@github.com> On Thu, 27 Mar 2025 13:00:20 GMT, Emanuel Peter wrote: > This is a big patch, but about 3.5k lines are tests. And a large part of the VM changes is comments / proofs. > > I am adding a dynamic (runtime) aliasing check to the auto-vectorizer (SuperWord). We use the infrastructure from https://github.com/openjdk/jdk/pull/22016: > - Use the auto-vectorization `predicate` when available: we speculate that there is no aliasing, else we trap and re-compile without the predicate. > - If the predicate is not available, we use `multiversioning`, i.e. we have a `fast_loop` where there is no aliasing, and hence vectorization. And a `slow_loop` if the check fails, with no vectorization. > > -------------------------- > > **Where to start reviewing** > > - `src/hotspot/share/opto/mempointer.hpp`: > - Read the class comment for `MemPointerRawSummand`. > - Familiarize yourself with the `MemPointer Linearity Corrolary`. We need it for the proofs of the aliasing runtime checks. > > - `src/hotspot/share/opto/vectorization.cpp`: > - Read the explanations and proofs above `VPointer::can_make_speculative_aliasing_check_with`. It explains how the aliasing runtime check works. > > - `src/hotspot/share/opto/vtransform.hpp`: > - Understand the difference between weak and strong edges. > > If you need to see some examples, then look at the tests: > - `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasing.java`: simple array cases. IR rules that check for vectors and in somecases if we used multiversioning. > - `test/micro/org/openjdk/bench/vm/compiler/VectorAliasing.java`: the miro-benchmarks I show below. Simple array cases. > - `test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegmentAliasing.java`: a bit advanced, but similar cases. > - `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java`: very large and rather compliex. Generates random loops, some with and some without aliasing at runtime. IR verification, but mostly currently only for array cases, MemorySegment cases have some issues (see comments). > -------------------------- > > **Details** > > Most fundamentally: > - I had to refactor / extend `MemPointer` so that we have access to `MemPointerRawSummand`s. > - These raw summands us to reconstruct the `VPointer` at any `iv` value with `VPointer::make_pointer_expression(Node* iv_value)`. > - With the raw summands, a pointer may look like this: `p = base + ConvI2L(x + 2) + ConvI2L(y + 2)` > - With "regular" summands, this gets simplified to `p = base + 4L +ConvI2L(x) + ConvI2L(y)` > - For aliasing analysis (adjacency and overlap), the "regu... Thank you, @eme64, for this good work! I left some comments below. src/hotspot/share/opto/mempointer.cpp line 732: > 730: // -> Unknown if overlap at runtime -> return false > 731: bool MemPointer::always_overlaps_with(const MemPointer& other) const { > 732: const MemPointerAliasing aliasing = get_aliasing_with(other NOT_PRODUCT( COMMA _trace )); Suggestion: const MemPointerAliasing aliasing = get_aliasing_with(other NOT_PRODUCT(COMMA _trace)); Nit: You used this without spaces already above. src/hotspot/share/opto/mempointer.hpp line 411: > 409: // Both p and mp have a linear form for v in r: > 410: // p(v) = p(lo) - lo * scale_v + iv * scale_v (Corrolary P) > 411: // mp(v) = mp(lo) - lo * scale_v + iv * scale_v (Corrolary MP) Where does `iv`come from? Is `v==iv`? src/hotspot/share/opto/mempointer.hpp line 444: > 442: // = summand_rest + scale_v * (v0 + stride_v) + con > 443: // = summand_rest + scale_v * v0 + scale_v * stride_v * con > 444: // = summand_rest + scale_v * v0 + scale_v * stride_v * con Suggestion: // = summand_rest + scale_v * v0 + scale_v * stride_v + con // = summand_rest + scale_v * v0 + scale_v * stride_v + con These ought to be plusses. src/hotspot/share/opto/mempointer.hpp line 663: > 661: }; > 662: > 663: // The MemPointerSummand is designed to allow the simplification of Shouldn't this be `MemPointerRawSummand`? src/hotspot/share/opto/mempointer.hpp line 706: > 704: // Note: we also need to track constants as separate raw summands. For > 705: // this, we say that a raw summand tracks a constant iff _variable == null, > 706: // and we store the constant value in _scaleI. This contradicts the `con2` example above. src/hotspot/share/opto/mempointer.hpp line 731: > 729: } > 730: > 731: bool is_valid() const { return _int_group >= 0; } Why is _int_group not a `uint` if it is always positive or 0? src/hotspot/share/opto/superword.cpp line 836: > 834: > 835: // If we cannot speculate (aliasing analysis runtime checks), we need to respect all edges. > 836: bool with_weak_memory_edges = !_vloop.use_speculative_aliasing_checks(); Edges that always have to be respected are strong edges. So, if we cannot speculate, we only have strong edges. With this comment and understanding, I would write the expression as bool with_weak_memory_edges = _vloop.use_speculative_aliasing_checks(); or bool with_strong_memory_edges = !_vloop.use_speculative_aliasing_checks(); src/hotspot/share/opto/superword.cpp line 878: > 876: > 877: // If we cannot speculate (aliasing analysis runtime checks), we need to respect all edges. > 878: bool with_weak_memory_edges = !_vloop.use_speculative_aliasing_checks(); Same as above. src/hotspot/share/opto/vectorization.hpp line 240: > 238: } > 239: > 240: // But in some cases, we ctrl of n is between the pre and Suggestion: // But in some cases, the ctrl of n is between the pre and Nit: spelling src/hotspot/share/opto/vtransform.hpp line 286: > 284: // dependency chain. Instead, we model the memory edges between all memory nodes, which > 285: // could be quadratic in the worst case. For vectorization, we must essencially reorder the > 286: // instructions in the graph. For this we must model all memory dependencies. Suggestion: // The C2 IR Node memory edges essentially define a linear order of all memory operations // (only Loads with the same memory input can be executed in an arbitrary order). This is // efficient, because it means every Load and Store has exactly one input memory edge, // which keeps the memory edge count linear. This is approach is too restrictive for // vectorization, for example, we could never vectorize stores, since they are all in a // dependency chain. Instead, we model the memory edges between all memory nodes, which // could be quadratic in the worst case. For vectorization, we must essentially reorder the // instructions in the graph. For this we must model all memory dependencies. Spelling test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java line 176: > 174: long t0 = System.nanoTime(); > 175: // Add a java source file. > 176: comp.addJavaSourceCode("p.xyz.InnerTest", generate(comp)); Nit: perhaps a package related to the test might be nicer in the logs. Like `compiler.loopopts.superword.templated.AliasingFuzzer` test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java line 270: > 268: // > 269: // The idea is that invarRest is always close to zero, with some small range [-err .. err]. > 270: // The invar variables for invarRest must be in the range [-1, 1, 1], so that we can Suggestion: // The invar variables for invarRest must be in the range [-1, 0, 1], so that we can ------------- Changes requested by mhaessig (Committer). PR Review: https://git.openjdk.org/jdk/pull/24278#pullrequestreview-3061496983 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2235976063 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2235538590 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2235767065 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2235785157 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2235881210 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2235887862 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2236162728 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2236175460 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2236187788 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2236082499 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2236409739 PR Review Comment: https://git.openjdk.org/jdk/pull/24278#discussion_r2236425351 From fyang at openjdk.org Mon Jul 28 13:47:56 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Jul 2025 13:47:56 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v3] In-Reply-To: References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> Message-ID: On Mon, 28 Jul 2025 13:34:07 GMT, Hamlin Li wrote: >> Yes, the `dest` param here holds the address of this stub. >> In `CallRelocation::fix_relocation_after_move`, we first get the `callee` address by calling `pd_call_destination` which delegates work to `NativeCall::reloc_destination` for a NativeCall. And we have modified `NativeCall::reloc_destination` to return the stub address in this PR at the same time. So `callee` will hold the stub address. Immediatedly after that, `callee` is passed to `pd_set_call_destination` which delegates work to `NativeCall::reloc_set_destination`. Make sense? >> >> >> void CallRelocation::fix_relocation_after_move(const CodeBuffer* src, CodeBuffer* dest) { >> // Usually a self-relative reference to an external routine. >> // On some platforms, the reference is absolute (not self-relative). >> // The enhanced use of pd_call_destination sorts this all out. >> address orig_addr = old_addr_for(addr(), src, dest); >> address callee = pd_call_destination(orig_addr); <=========== callee is stub address >> // Reassert the callee address, this time in the new copy of the code. >> pd_set_call_destination(callee); <=========== callee passed as param >> } > > There is another call of `Relocation::pd_set_call_destination(address x)` from `CallRelocation::set_destination(address x)`, not sure if this `x` passed in from set_destination is also the stub addr? Sure, I'll take a look. Thanks. BTW: I noticed one use of `set_destination` in file ./code/aotCodeCache.cpp. Seems AOT related. Maybe you can help confirm that while you are working on enabling AOT support on riscv64? ./code/aotCodeCache.cpp: ((CallRelocation*)iter.reloc())->set_destination(dest); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2236539009 From mli at openjdk.org Mon Jul 28 13:59:54 2025 From: mli at openjdk.org (Hamlin Li) Date: Mon, 28 Jul 2025 13:59:54 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v3] In-Reply-To: References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> Message-ID: <2_gymTx2AihIEDELcrUUe9jIOq6QlKxZtl7rUQaCTgg=.d3246389-7bf8-4206-b40d-0fbe47436f37@github.com> On Mon, 28 Jul 2025 13:42:57 GMT, Fei Yang wrote: >> There is another call of `Relocation::pd_set_call_destination(address x)` from `CallRelocation::set_destination(address x)`, not sure if this `x` passed in from set_destination is also the stub addr? > > Sure, I'll take a look. Thanks. > BTW: I noticed one use of `set_destination` in file ./code/aotCodeCache.cpp. Seems AOT related. > Maybe you can help confirm that while you are working on enabling AOT support on riscv64? > > > ./code/aotCodeCache.cpp: ((CallRelocation*)iter.reloc())->set_destination(dest); One suggestion: maybe we can add an additional parameter here for `NativeCall::reloc_set_destination(address dest)` like `is_stub_addr`, so on the path we're sure `dest` is the stub address rather than reloc call destination, we can pass true for `is_stub_addr`, and in other paths we can pass false. And we can also add an assert in the `is_stub_addr == true` path, like `assert(CodeCache::contains(x), "must");`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2236601333 From chagedorn at openjdk.org Mon Jul 28 14:06:57 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 28 Jul 2025 14:06:57 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v4] In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 12:35:15 GMT, Beno?t Maillard wrote: >> This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). >> >> The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: >> - `ConvD2L->ConvL2D->ConvD2L` >> - `ConvF2I->ConvI2F->ConvF2I` >> - `ConvF2L->ConvL2F->ConvF2L` >> - `ConvI2F->ConvF2I->ConvI2F` >> >> Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. >> >> This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) >> - [x] tier1-3, plus some internal testing >> - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > 8359603: Add note Update looks good, thanks! ------------- Marked as reviewed by chagedorn (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26368#pullrequestreview-3062939535 From fyang at openjdk.org Mon Jul 28 14:15:56 2025 From: fyang at openjdk.org (Fei Yang) Date: Mon, 28 Jul 2025 14:15:56 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v3] In-Reply-To: <2_gymTx2AihIEDELcrUUe9jIOq6QlKxZtl7rUQaCTgg=.d3246389-7bf8-4206-b40d-0fbe47436f37@github.com> References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> <2_gymTx2AihIEDELcrUUe9jIOq6QlKxZtl7rUQaCTgg=.d3246389-7bf8-4206-b40d-0fbe47436f37@github.com> Message-ID: On Mon, 28 Jul 2025 13:57:08 GMT, Hamlin Li wrote: >> Sure, I'll take a look. Thanks. >> BTW: I noticed one use of `set_destination` in file ./code/aotCodeCache.cpp. Seems AOT related. >> Maybe you can help confirm that while you are working on enabling AOT support on riscv64? >> >> >> ./code/aotCodeCache.cpp: ((CallRelocation*)iter.reloc())->set_destination(dest); > > One suggestion: maybe we can add an additional parameter here for `NativeCall::reloc_set_destination(address dest)` like `is_stub_addr`, so on the path we're sure `dest` is the stub address rather than reloc call destination, we can pass true for `is_stub_addr`, and in other paths we can pass false. > And we can also add an assert in the `is_stub_addr == true` path, like `assert(CodeCache::contains(x), "must");`. I intend to think that by design we only want `NativeCall::reloc_destination` and `NativeCall::reloc_set_destination` to get and set stub address. There are other functions like `NativeCall::destination`, `NativeCall::set_destination` and `NativeCall::set_destination_mt_safe` which are supposed to deal with the real call target. So I am going to check the `CallRelocation::set_destination` cases in shared code as you mentioned. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2236663847 From kvn at openjdk.org Mon Jul 28 14:42:03 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 28 Jul 2025 14:42:03 GMT Subject: RFR: 8352112: [ubsan] hotspot/share/code/relocInfo.cpp:130:37: runtime error: applying non-zero offset 18446744073709551614 to null pointer [v2] In-Reply-To: References: Message-ID: <7ja3_KpFi1NPc4EPFpMk3af7RgGtQYu0zGmrv05lCj0=.a7fb616e-8923-47f1-b869-3bb064d27f58@github.com> On Mon, 28 Jul 2025 12:39:41 GMT, Boris Ulasevich wrote: >> src/hotspot/share/code/codeBlob.cpp line 156: >> >>> 154: } else { >>> 155: // We need unique and valid not null address >>> 156: assert(_mutable_data = blob_end(), "sanity"); >> >> Did this mean to assign the `_mutable_data`? I think it should be `==`. > > Right. This typo was fixed in https://github.com/openjdk/jdk/pull/26175 > For now I do not see how this change is related with [JDK-8361382: NMT corruption](https://bugs.openjdk.org/browse/JDK-8361382) Yes, it was fixed. And they were harmless. I think @jdksjolen linked it because of call stack. But I also don't know how it could cause NMT bug. @jdksjolen did you try to to undo these changes and reproduce https://bugs.openjdk.org/browse/JDK-8361382 ? V [libjvm.dylib+0xbf1c8c] VMError::report(outputStream*, bool)+0xa9c (mallocHeader.inline.hpp:107) V [libjvm.dylib+0xbf5d25] VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x575 V [libjvm.dylib+0x404e20] DebuggingContext::~DebuggingContext()+0x0 V [libjvm.dylib+0x8f770f] MallocHeader* MallocHeader::resolve_checked_impl(void*)+0x15f V [libjvm.dylib+0x8f720c] MallocTracker::record_free_block(void*)+0xc V [libjvm.dylib+0x9a719a] os::free(void*)+0xea V [libjvm.dylib+0x388fb4] CodeBlob::purge()+0x44 V [libjvm.dylib+0x978e98] nmethod::purge(bool)+0x308 V [libjvm.dylib+0x380439] ClassUnloadingContext::purge_nmethods()+0x69 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24102#discussion_r2236768883 From jkarthikeyan at openjdk.org Mon Jul 28 17:17:05 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 28 Jul 2025 17:17:05 GMT Subject: RFR: 8362979: C2 fails with unexpected node in SuperWord truncation: CmpLTMask, RoundF In-Reply-To: References: Message-ID: <3idG5MS8JXypvRF8lzh878fj5MDTg9u4qDVeW4QmZ0Q=.00e61365-e0f4-4232-a8a2-384dc5d1b452@github.com> On Mon, 28 Jul 2025 02:34:25 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a fix for a debug assert failure in SuperWord truncation for `CmpLTMask` and `RoundF` nodes, as discovered by CTW in the linked JBS report. I've added the nodes to the switch, and added reduced test cases. I've made a similar fix for `RoundD` nodes as well. Thanks! Thanks for the reviews! Hmm, looks like the bot didn't catch it. Retrying... ------------- PR Comment: https://git.openjdk.org/jdk/pull/26494#issuecomment-3127376221 PR Comment: https://git.openjdk.org/jdk/pull/26494#issuecomment-3128202904 From jkarthikeyan at openjdk.org Mon Jul 28 17:17:06 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Mon, 28 Jul 2025 17:17:06 GMT Subject: Integrated: 8362979: C2 fails with unexpected node in SuperWord truncation: CmpLTMask, RoundF In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 02:34:25 GMT, Jasmine Karthikeyan wrote: > Hi all, > This is a fix for a debug assert failure in SuperWord truncation for `CmpLTMask` and `RoundF` nodes, as discovered by CTW in the linked JBS report. I've added the nodes to the switch, and added reduced test cases. I've made a similar fix for `RoundD` nodes as well. Thanks! This pull request has now been integrated. Changeset: ea0b49c3 Author: Jasmine Karthikeyan URL: https://git.openjdk.org/jdk/commit/ea0b49c36db7dce508aec7e72e73c7274d65bc15 Stats: 57 lines in 3 files changed: 57 ins; 0 del; 0 mod 8362979: C2 fails with unexpected node in SuperWord truncation: CmpLTMask, RoundF Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/26494 From fyang at openjdk.org Tue Jul 29 01:55:10 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 29 Jul 2025 01:55:10 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v4] In-Reply-To: References: Message-ID: <_gBsBRuwEYg_z4Fy1eTSI0ATppAF85SFB9fNkXSwe8E=.7c88580a-bb52-4066-977e-29c84b8b8b56@github.com> > Hi, please consider this small change. > > JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. > > We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call > and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. > > Testing on linux-riscv64: > - [x] tier1-tier3 (release build) > - [x] hs:tier1-hs:tier3 (fastdebug build) > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 Fei Yang has updated the pull request incrementally with one additional commit since the last revision: Assert ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26495/files - new: https://git.openjdk.org/jdk/pull/26495/files/01949774..8fa3d037 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=02-03 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26495.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26495/head:pull/26495 PR: https://git.openjdk.org/jdk/pull/26495 From fjiang at openjdk.org Tue Jul 29 01:55:10 2025 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 29 Jul 2025 01:55:10 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v4] In-Reply-To: <_gBsBRuwEYg_z4Fy1eTSI0ATppAF85SFB9fNkXSwe8E=.7c88580a-bb52-4066-977e-29c84b8b8b56@github.com> References: <_gBsBRuwEYg_z4Fy1eTSI0ATppAF85SFB9fNkXSwe8E=.7c88580a-bb52-4066-977e-29c84b8b8b56@github.com> Message-ID: On Tue, 29 Jul 2025 01:24:02 GMT, Fei Yang wrote: >> Hi, please consider this small change. >> >> JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. >> >> We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call >> and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. >> >> Testing on linux-riscv64: >> - [x] tier1-tier3 (release build) >> - [x] hs:tier1-hs:tier3 (fastdebug build) >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Assert Looks good, thanks! ------------- Marked as reviewed by fjiang (Committer). PR Review: https://git.openjdk.org/jdk/pull/26495#pullrequestreview-3064923750 From fyang at openjdk.org Tue Jul 29 01:55:11 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 29 Jul 2025 01:55:11 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v4] In-Reply-To: References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> <2_gymTx2AihIEDELcrUUe9jIOq6QlKxZtl7rUQaCTgg=.d3246389-7bf8-4206-b40d-0fbe47436f37@github.com> Message-ID: <_6L3mwdTfWCFoSohvs3SejxeeRIxW-XZQquD-I_Nay8=.951d51f8-0e98-4047-a24d-156c6b9d2e18@github.com> On Mon, 28 Jul 2025 14:12:32 GMT, Fei Yang wrote: >> One suggestion: maybe we can add an additional parameter here for `NativeCall::reloc_set_destination(address dest)` like `is_stub_addr`, so on the path we're sure `dest` is the stub address rather than reloc call destination, we can pass true for `is_stub_addr`, and in other paths we can pass false. >> And we can also add an assert in the `is_stub_addr == true` path, like `assert(CodeCache::contains(x), "must");`. > > I intend to think that by design we only want `NativeCall::reloc_destination` and `NativeCall::reloc_set_destination` to get and set stub address. There are other functions like `NativeCall::destination`, `NativeCall::set_destination` and `NativeCall::set_destination_mt_safe` which are supposed to deal with the real call target. So I am going to check the `CallRelocation::set_destination` cases in shared code as you mentioned. Here is what I find. There are only two use cases of `CallRelocation::set_destination(address x)` in hotspot shared code. One is the AOT case [1] and the other is `CallRelocation::set_value(address x)` [2]. And `CallRelocation::set_value(address x)` is never used. Given that we don't have AOT support for riscv64 in JDK upstream yet, I think we are safe. And I have added assertion in `CallRelocation::set_value(address x)` about `dest` param to make sure it's the same as the stub address. I see `hs:tier1` still test good with fastdebug build. Also I think it's better to investigate and fix the AOT use case in Leyden premain as you are currently working on if it turns out to be an issue. Does that work for you? ./relocInfo.hpp: void set_value(address x) override { set_destination(x); } [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/aotCodeCache.cpp#L1111 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L955 ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2238085011 From duke at openjdk.org Tue Jul 29 07:11:55 2025 From: duke at openjdk.org (duke) Date: Tue, 29 Jul 2025 07:11:55 GMT Subject: RFR: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist [v4] In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 12:35:15 GMT, Beno?t Maillard wrote: >> This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). >> >> The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: >> - `ConvD2L->ConvL2D->ConvD2L` >> - `ConvF2I->ConvI2F->ConvF2I` >> - `ConvF2L->ConvL2F->ConvF2L` >> - `ConvI2F->ConvF2I->ConvI2F` >> >> Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. >> >> This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. >> >> ### Testing >> - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) >> - [x] tier1-3, plus some internal testing >> - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) >> >> Thank you for reviewing! > > Beno?t Maillard has updated the pull request incrementally with one additional commit since the last revision: > > 8359603: Add note @benoitmaillard Your change (at version 2e5efdcc2ce20f8f311371388bcfe9614435816b) is now ready to be sponsored by a Committer. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26368#issuecomment-3131004577 From bmaillard at openjdk.org Tue Jul 29 07:36:05 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Tue, 29 Jul 2025 07:36:05 GMT Subject: Integrated: 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 12:25:33 GMT, Beno?t Maillard wrote: > This PR addresses a missed optimization in `PhaseIterGVN` due to the lack of change notification to indirect users within `PhaseIterGVN::add_users_of_use_to_worklist` (again). This is similar to [JDK-8361700](https://bugs.openjdk.org/browse/JDK-8361700?filter=-1). > > The optimization in question is the removal of redundant `ConvX2Y->ConvY2X->ConvX2Y` sequences (where `X` and `Y` are primitive number types), which get replaced by a single `ConvX2Y` as an identity optimization. This missing optimization was originally reported only for `ConvD2LNode`, but it turns out that other conversion nodes have analog optimization patterns. After manual inspection of identity optimizations in `convertnode.cpp`, I was able to reproduce missing optimizations for the following conversion sequences: > - `ConvD2L->ConvL2D->ConvD2L` > - `ConvF2I->ConvI2F->ConvF2I` > - `ConvF2L->ConvL2F->ConvF2L` > - `ConvI2F->ConvF2I->ConvI2F` > > Similar optimization patterns exist for additional conversion nodes. However, it is not clear if these nodes are subject to the same missed optimization issue. Further investigation may be needed, as I was unable to reproduce such cases with simple tests. > > This is again a case where an optimization depends on the input of its input. Currently, `PhaseIterGVN::add_users_of_use_to_worklist` contains specific logic to handle similar dependencies for other cases, but this specific scenario is not addressed. The proposed fix adds the necessary logic in `add_users_of_use_to_worklist` to ensure proper notification for this optimization pattern. > > ### Testing > - [x] [GitHub Actions](https://github.com/benoitmaillard/jdk/actions?query=branch%3AJDK-8359603) > - [x] tier1-3, plus some internal testing > - [x] Added test from the fuzzer, and tests for other sequences (manually derived from the original one) > > Thank you for reviewing! This pull request has now been integrated. Changeset: 28297411 Author: Beno?t Maillard Committer: Christian Hagedorn URL: https://git.openjdk.org/jdk/commit/28297411b19551dd8585165200f5f8158f3d5bb3 Stats: 125 lines in 2 files changed: 125 ins; 0 del; 0 mod 8359603: Missed optimization in PhaseIterGVN for redundant ConvX2Y->ConvY2X->ConvX2Y sequences due to missing notification in PhaseIterGVN::add_users_of_use_to_worklist Reviewed-by: chagedorn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/26368 From duke at openjdk.org Tue Jul 29 07:42:57 2025 From: duke at openjdk.org (erifan) Date: Tue, 29 Jul 2025 07:42:57 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v7] In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 11:55:52 GMT, Christian Hagedorn wrote: > I'll give this a spin in our testing - will report the results back later. Ok, thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3131099095 From chagedorn at openjdk.org Tue Jul 29 08:24:58 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Tue, 29 Jul 2025 08:24:58 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v7] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 09:27:19 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Move the assertion to the beginning of the code block Testing is currently slow - still running but I report what I have so far. There is one test failure on `linux-aarch64` and `macosx-aarch64` with the new test `VectorMaskToLongTest.java`: Additional flags: `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation` (probably only related to `-XX:-TieredComilation`, maybe we have not enough profiling and need to increase the warm-up but that's just a wild guess without looking at the test)
Log Compilations (5) of Failed Methods (5) -------------------------------------- 1) Compilation of "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongByte()": > Phase "PrintIdeal": AFTER: print_ideal 0 Root === 0 203 230 269 270 [[ 0 1 3 225 198 189 23 186 167 28 44 124 104 54 277 295 ]] inner 1 Con === 0 [[ ]] #top 3 Start === 3 0 [[ 3 5 6 7 8 9 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address} 5 Parm === 3 [[ 168 ]] Control !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) 6 Parm === 3 [[ 168 ]] I_O !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) 7 Parm === 3 [[ 168 279 286 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) 8 Parm === 3 [[ 270 269 255 233 230 199 203 168 226 ]] FramePtr !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) 9 Parm === 3 [[ 270 269 199 226 ]] ReturnAdr !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) 23 ConP === 0 [[ 255 168 ]] #jdk/incubator/vector/ByteVector$ByteSpecies (jdk/incubator/vector/VectorSpecies):exact * Oop:jdk/incubator/vector/ByteVector$ByteSpecies (jdk/incubator/vector/VectorSpecies):exact * 28 ConI === 0 [[ 168 ]] #int:1 44 ConI === 0 [[ 168 ]] #int:16 54 ConL === 0 [[ 255 233 168 168 199 ]] #long:65534 104 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 124 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 167 ConP === 0 [[ 168 ]] #jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * Oop:jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * 168 CallStaticJava === 5 6 7 8 1 (104 124 44 54 1 28 23 167 54 1 1 1 1 1 1 1 ) [[ 169 181 182 173 ]] # Static jdk.internal.vm.vector.VectorSupport::fromBitsCoerced jdk/internal/vm/vector/VectorSupport$VectorPayload * ( java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, int, long, half, int, jdk/internal/vm/vector/VectorSupport$VectorSpecies *, java/lang/Object * ) VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongTo LongByte @ bci:22 (line 183) 169 Proj === 168 [[ 175 ]] #0 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 173 Proj === 168 [[ 226 190 278 278 222 ]] #5 Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 175 Catch === 169 181 [[ 176 177 ]] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 176 CatchProj === 175 [[ 192 ]] #0 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 177 CatchProj === 175 [[ 251 180 ]] #1 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 180 CreateEx === 177 181 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 181 Proj === 168 [[ 233 199 226 252 175 180 ]] #1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 182 Proj === 168 [[ 233 226 199 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 186 ConP === 0 [[ 296 ]] #precise jdk/incubator/vector/VectorMask: 0x0000000148423080:Constant:exact * Klass:precise jdk/incubator/vector/VectorMask: 0x0000000148423080:Constant:exact * 189 ConP === 0 [[ 190 199 ]] #null 190 CmpP === _ 173 189 [[ 191 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 191 Bool === _ 190 [[ 192 ]] [ne] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 192 If === 176 191 [[ 193 194 ]] P=0.999999, C=-1.000000 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 193 IfFalse === 192 [[ 199 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 194 IfTrue === 192 [[ 299 279 ]] #1 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 198 ConI === 0 [[ 199 ]] #int:-12 199 CallStaticJava === 193 181 182 8 9 (198 54 1 1 1 1 1 1 1 189 ) [[ 200 ]] # Static uncommon_trap(reason='null_check' action='make_not_entrant' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 200 Proj === 199 [[ 203 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 203 Halt === 200 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 222 CheckCastPP === 300 173 [[ 233 ]] #jdk/incubator/vector/VectorMask:NotNull * Oop:jdk/incubator/vector/VectorMask:NotNull * !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 225 ConI === 0 [[ 226 ]] #int:-34 226 CallStaticJava === 301 181 182 8 9 (225 1 1 1 1 1 1 1 1 173 ) [[ 227 ]] # Static uncommon_trap(reason='class_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 227 Proj === 226 [[ 230 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 230 Halt === 227 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 233 CallDynamicJava === 300 181 182 8 1 (222 54 1 1 1 ) [[ 234 246 247 238 ]] # Dynamic jdk.incubator.vector.VectorMask::toLong long/half ( jdk/incubator/vector/VectorMask:NotNull * ) VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 234 Proj === 233 [[ 240 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 238 Proj === 233 [[ 255 ]] #5 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 240 Catch === 234 246 [[ 241 242 ]] !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 241 CatchProj === 240 [[ 255 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 242 CatchProj === 240 [[ 251 245 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 245 CreateEx === 242 246 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 246 Proj === 233 [[ 255 252 240 245 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 247 Proj === 233 [[ 255 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 251 Region === 251 177 242 263 [[ 251 252 253 254 270 ]] !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 252 Phi === 251 181 246 257 [[ 270 ]] #abIO !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 253 Phi === 251 182 247 258 [[ 270 ]] #memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 254 Phi === 251 180 245 266 [[ 270 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:25 (line 183) 255 CallStaticJava === 241 246 247 8 1 (23 54 1 238 1 1 1 1 1 ) [[ 256 257 258 ]] # Static compiler.vectorapi.VectorMaskToLongTest::verifyMaskToLong void ( java/lang/Object *, long, half, long, half ) VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) 256 Proj === 255 [[ 261 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) 257 Proj === 255 [[ 269 261 252 266 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) 258 Proj === 255 [[ 269 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) 261 Catch === 256 257 [[ 262 263 ]] !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) 262 CatchProj === 261 [[ 269 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) 263 CatchProj === 261 [[ 251 266 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) 266 CreateEx === 263 257 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:34 (line 184) 269 Return === 262 257 258 8 9 [[ 0 ]] 270 Rethrow === 251 252 253 8 9 exception 254 [[ 0 ]] 277 ConL === 0 [[ 278 ]] #long:8 278 AddP === _ 173 173 277 [[ 279 ]] Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload+8 * [narrowklass] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 279 LoadNKlass === 194 7 278 [[ 280 ]] @java/lang/Object+8 * [narrowklass], idx=5; #narrowklass: jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 280 DecodeNKlass === _ 279 [[ 285 285 ]] #jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 285 AddP === _ 280 280 295 [[ 286 ]] Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8+80 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 286 LoadKlass === _ 7 285 [[ 296 ]] @java/lang/Object: 0x000000014800a050+any *, idx=6; # * Klass: * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 295 ConL === 0 [[ 285 ]] #long:80 296 CmpP === _ 286 186 [[ 298 ]] !orig=[289] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 298 Bool === _ 296 [[ 299 ]] [ne] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 299 If === 194 298 [[ 300 301 ]] P=0.170000, C=-1.000000 !orig=[291] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 300 IfFalse === 299 [[ 222 233 ]] #0 !orig=[292],[273],[220] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 301 IfTrue === 299 [[ 226 ]] #1 !orig=[293],[274],[221] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongByte @ bci:22 (line 183) 2) Compilation of "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongDouble()": > Phase "PrintIdeal": AFTER: print_ideal 0 Root === 0 203 230 269 270 [[ 0 1 3 225 198 189 23 186 167 28 44 124 104 54 277 295 ]] inner 1 Con === 0 [[ ]] #top 3 Start === 3 0 [[ 3 5 6 7 8 9 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address} 5 Parm === 3 [[ 168 ]] Control !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:-1 (line 252) 6 Parm === 3 [[ 168 ]] I_O !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:-1 (line 252) 7 Parm === 3 [[ 168 279 286 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:-1 (line 252) 8 Parm === 3 [[ 270 269 255 233 230 199 203 168 226 ]] FramePtr !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:-1 (line 252) 9 Parm === 3 [[ 270 269 199 226 ]] ReturnAdr !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:-1 (line 252) 23 ConP === 0 [[ 255 168 ]] #jdk/incubator/vector/DoubleVector$DoubleSpecies (jdk/incubator/vector/VectorSpecies):exact * Oop:jdk/incubator/vector/DoubleVector$DoubleSpecies (jdk/incubator/vector/VectorSpecies):exact * 28 ConI === 0 [[ 168 ]] #int:1 44 ConI === 0 [[ 168 ]] #int:2 54 ConL === 0 [[ 255 233 168 168 199 ]] #long:2 104 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 124 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 167 ConP === 0 [[ 168 ]] #jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * Oop:jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * 168 CallStaticJava === 5 6 7 8 1 (104 124 44 54 1 28 23 167 54 1 1 1 1 1 1 1 ) [[ 169 181 182 173 ]] # Static jdk.internal.vm.vector.VectorSupport::fromBitsCoerced jdk/internal/vm/vector/VectorSupport$VectorPayload * ( java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, int, long, half, int, jdk/internal/vm/vector/VectorSupport$VectorSpecies *, java/lang/Object * ) VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLong ToLongDouble @ bci:22 (line 253) 169 Proj === 168 [[ 175 ]] #0 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 173 Proj === 168 [[ 226 190 278 278 222 ]] #5 Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 175 Catch === 169 181 [[ 176 177 ]] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 176 CatchProj === 175 [[ 192 ]] #0 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 177 CatchProj === 175 [[ 251 180 ]] #1 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 180 CreateEx === 177 181 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 181 Proj === 168 [[ 233 199 226 252 175 180 ]] #1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 182 Proj === 168 [[ 233 226 199 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 186 ConP === 0 [[ 296 ]] #precise jdk/incubator/vector/VectorMask: 0x00000001484db080:Constant:exact * Klass:precise jdk/incubator/vector/VectorMask: 0x00000001484db080:Constant:exact * 189 ConP === 0 [[ 190 199 ]] #null 190 CmpP === _ 173 189 [[ 191 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 191 Bool === _ 190 [[ 192 ]] [ne] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 192 If === 176 191 [[ 193 194 ]] P=0.999999, C=-1.000000 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 193 IfFalse === 192 [[ 199 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 194 IfTrue === 192 [[ 299 279 ]] #1 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 198 ConI === 0 [[ 199 ]] #int:-12 199 CallStaticJava === 193 181 182 8 9 (198 54 1 1 1 1 1 1 1 189 ) [[ 200 ]] # Static uncommon_trap(reason='null_check' action='make_not_entrant' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 200 Proj === 199 [[ 203 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 203 Halt === 200 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 222 CheckCastPP === 300 173 [[ 233 ]] #jdk/incubator/vector/VectorMask:NotNull * Oop:jdk/incubator/vector/VectorMask:NotNull * !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 225 ConI === 0 [[ 226 ]] #int:-34 226 CallStaticJava === 301 181 182 8 9 (225 1 1 1 1 1 1 1 1 173 ) [[ 227 ]] # Static uncommon_trap(reason='class_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 227 Proj === 226 [[ 230 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 230 Halt === 227 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 233 CallDynamicJava === 300 181 182 8 1 (222 54 1 1 1 ) [[ 234 246 247 238 ]] # Dynamic jdk.incubator.vector.VectorMask::toLong long/half ( jdk/incubator/vector/VectorMask:NotNull * ) VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 234 Proj === 233 [[ 240 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 238 Proj === 233 [[ 255 ]] #5 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 240 Catch === 234 246 [[ 241 242 ]] !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 241 CatchProj === 240 [[ 255 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 242 CatchProj === 240 [[ 251 245 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 245 CreateEx === 242 246 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 246 Proj === 233 [[ 255 252 240 245 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 247 Proj === 233 [[ 255 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 251 Region === 251 177 242 263 [[ 251 252 253 254 270 ]] !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 252 Phi === 251 181 246 257 [[ 270 ]] #abIO !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 253 Phi === 251 182 247 258 [[ 270 ]] #memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 254 Phi === 251 180 245 266 [[ 270 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:25 (line 253) 255 CallStaticJava === 241 246 247 8 1 (23 54 1 238 1 1 1 1 1 ) [[ 256 257 258 ]] # Static compiler.vectorapi.VectorMaskToLongTest::verifyMaskToLong void ( java/lang/Object *, long, half, long, half ) VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) 256 Proj === 255 [[ 261 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) 257 Proj === 255 [[ 269 261 252 266 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) 258 Proj === 255 [[ 269 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) 261 Catch === 256 257 [[ 262 263 ]] !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) 262 CatchProj === 261 [[ 269 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) 263 CatchProj === 261 [[ 251 266 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) 266 CreateEx === 263 257 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongDouble @ bci:34 (line 254) 269 Return === 262 257 258 8 9 [[ 0 ]] 270 Rethrow === 251 252 253 8 9 exception 254 [[ 0 ]] 277 ConL === 0 [[ 278 ]] #long:8 278 AddP === _ 173 173 277 [[ 279 ]] Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload+8 * [narrowklass] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 279 LoadNKlass === 194 7 278 [[ 280 ]] @java/lang/Object+8 * [narrowklass], idx=5; #narrowklass: jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 280 DecodeNKlass === _ 279 [[ 285 285 ]] #jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 285 AddP === _ 280 280 295 [[ 286 ]] Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8+80 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 286 LoadKlass === _ 7 285 [[ 296 ]] @java/lang/Object: 0x000000014800a050+any *, idx=6; # * Klass: * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 295 ConL === 0 [[ 285 ]] #long:80 296 CmpP === _ 286 186 [[ 298 ]] !orig=[289] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 298 Bool === _ 296 [[ 299 ]] [ne] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 299 If === 194 298 [[ 300 301 ]] P=0.170000, C=-1.000000 !orig=[291] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 300 IfFalse === 299 [[ 222 233 ]] #0 !orig=[292],[273],[220] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 301 IfTrue === 299 [[ 226 ]] #1 !orig=[293],[274],[221] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongDouble @ bci:22 (line 253) 3) Compilation of "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongInt()": > Phase "PrintIdeal": AFTER: print_ideal 0 Root === 0 203 230 269 270 [[ 0 1 3 225 198 189 23 186 167 28 44 124 104 54 277 295 ]] inner 1 Con === 0 [[ ]] #top 3 Start === 3 0 [[ 3 5 6 7 8 9 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address} 5 Parm === 3 [[ 168 ]] Control !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:-1 (line 210) 6 Parm === 3 [[ 168 ]] I_O !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:-1 (line 210) 7 Parm === 3 [[ 168 279 286 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:-1 (line 210) 8 Parm === 3 [[ 270 269 255 233 230 199 203 168 226 ]] FramePtr !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:-1 (line 210) 9 Parm === 3 [[ 270 269 199 226 ]] ReturnAdr !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:-1 (line 210) 23 ConP === 0 [[ 255 168 ]] #jdk/incubator/vector/IntVector$IntSpecies (jdk/incubator/vector/VectorSpecies):exact * Oop:jdk/incubator/vector/IntVector$IntSpecies (jdk/incubator/vector/VectorSpecies):exact * 28 ConI === 0 [[ 168 ]] #int:1 44 ConI === 0 [[ 168 ]] #int:4 54 ConL === 0 [[ 255 233 168 168 199 ]] #long:14 104 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 124 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 167 ConP === 0 [[ 168 ]] #jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * Oop:jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * 168 CallStaticJava === 5 6 7 8 1 (104 124 44 54 1 28 23 167 54 1 1 1 1 1 1 1 ) [[ 169 181 182 173 ]] # Static jdk.internal.vm.vector.VectorSupport::fromBitsCoerced jdk/internal/vm/vector/VectorSupport$VectorPayload * ( java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, int, long, half, int, jdk/internal/vm/vector/VectorSupport$VectorSpecies *, java/lang/Object * ) VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToL ongInt @ bci:22 (line 211) 169 Proj === 168 [[ 175 ]] #0 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 173 Proj === 168 [[ 226 190 278 278 222 ]] #5 Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 175 Catch === 169 181 [[ 176 177 ]] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 176 CatchProj === 175 [[ 192 ]] #0 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 177 CatchProj === 175 [[ 251 180 ]] #1 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 180 CreateEx === 177 181 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 181 Proj === 168 [[ 233 199 226 252 175 180 ]] #1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 182 Proj === 168 [[ 233 226 199 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 186 ConP === 0 [[ 296 ]] #precise jdk/incubator/vector/VectorMask: 0x000000015829b080:Constant:exact * Klass:precise jdk/incubator/vector/VectorMask: 0x000000015829b080:Constant:exact * 189 ConP === 0 [[ 190 199 ]] #null 190 CmpP === _ 173 189 [[ 191 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 191 Bool === _ 190 [[ 192 ]] [ne] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 192 If === 176 191 [[ 193 194 ]] P=0.999999, C=-1.000000 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 193 IfFalse === 192 [[ 199 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 194 IfTrue === 192 [[ 299 279 ]] #1 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 198 ConI === 0 [[ 199 ]] #int:-12 199 CallStaticJava === 193 181 182 8 9 (198 54 1 1 1 1 1 1 1 189 ) [[ 200 ]] # Static uncommon_trap(reason='null_check' action='make_not_entrant' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 200 Proj === 199 [[ 203 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 203 Halt === 200 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 222 CheckCastPP === 300 173 [[ 233 ]] #jdk/incubator/vector/VectorMask:NotNull * Oop:jdk/incubator/vector/VectorMask:NotNull * !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 225 ConI === 0 [[ 226 ]] #int:-34 226 CallStaticJava === 301 181 182 8 9 (225 1 1 1 1 1 1 1 1 173 ) [[ 227 ]] # Static uncommon_trap(reason='class_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 227 Proj === 226 [[ 230 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 230 Halt === 227 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 233 CallDynamicJava === 300 181 182 8 1 (222 54 1 1 1 ) [[ 234 246 247 238 ]] # Dynamic jdk.incubator.vector.VectorMask::toLong long/half ( jdk/incubator/vector/VectorMask:NotNull * ) VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 234 Proj === 233 [[ 240 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 238 Proj === 233 [[ 255 ]] #5 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 240 Catch === 234 246 [[ 241 242 ]] !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 241 CatchProj === 240 [[ 255 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 242 CatchProj === 240 [[ 251 245 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 245 CreateEx === 242 246 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 246 Proj === 233 [[ 255 252 240 245 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 247 Proj === 233 [[ 255 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 251 Region === 251 177 242 263 [[ 251 252 253 254 270 ]] !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 252 Phi === 251 181 246 257 [[ 270 ]] #abIO !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 253 Phi === 251 182 247 258 [[ 270 ]] #memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 254 Phi === 251 180 245 266 [[ 270 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:25 (line 211) 255 CallStaticJava === 241 246 247 8 1 (23 54 1 238 1 1 1 1 1 ) [[ 256 257 258 ]] # Static compiler.vectorapi.VectorMaskToLongTest::verifyMaskToLong void ( java/lang/Object *, long, half, long, half ) VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) 256 Proj === 255 [[ 261 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) 257 Proj === 255 [[ 269 261 252 266 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) 258 Proj === 255 [[ 269 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) 261 Catch === 256 257 [[ 262 263 ]] !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) 262 CatchProj === 261 [[ 269 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) 263 CatchProj === 261 [[ 251 266 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) 266 CreateEx === 263 257 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongInt @ bci:34 (line 212) 269 Return === 262 257 258 8 9 [[ 0 ]] 270 Rethrow === 251 252 253 8 9 exception 254 [[ 0 ]] 277 ConL === 0 [[ 278 ]] #long:8 278 AddP === _ 173 173 277 [[ 279 ]] Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload+8 * [narrowklass] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 279 LoadNKlass === 194 7 278 [[ 280 ]] @java/lang/Object+8 * [narrowklass], idx=5; #narrowklass: jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 280 DecodeNKlass === _ 279 [[ 285 285 ]] #jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 285 AddP === _ 280 280 295 [[ 286 ]] Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8+80 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 286 LoadKlass === _ 7 285 [[ 296 ]] @java/lang/Object: 0x000000014800a050+any *, idx=6; # * Klass: * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 295 ConL === 0 [[ 285 ]] #long:80 296 CmpP === _ 286 186 [[ 298 ]] !orig=[289] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 298 Bool === _ 296 [[ 299 ]] [ne] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 299 If === 194 298 [[ 300 301 ]] P=0.170000, C=-1.000000 !orig=[291] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 300 IfFalse === 299 [[ 222 233 ]] #0 !orig=[292],[273],[220] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 301 IfTrue === 299 [[ 226 ]] #1 !orig=[293],[274],[221] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongInt @ bci:22 (line 211) 4) Compilation of "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongLong()": > Phase "PrintIdeal": AFTER: print_ideal 0 Root === 0 203 230 269 270 [[ 0 1 3 225 198 189 23 186 167 28 44 124 104 54 277 295 ]] inner 1 Con === 0 [[ ]] #top 3 Start === 3 0 [[ 3 5 6 7 8 9 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address} 5 Parm === 3 [[ 168 ]] Control !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:-1 (line 224) 6 Parm === 3 [[ 168 ]] I_O !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:-1 (line 224) 7 Parm === 3 [[ 168 279 286 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:-1 (line 224) 8 Parm === 3 [[ 270 269 255 233 230 199 203 168 226 ]] FramePtr !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:-1 (line 224) 9 Parm === 3 [[ 270 269 199 226 ]] ReturnAdr !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:-1 (line 224) 23 ConP === 0 [[ 255 168 ]] #jdk/incubator/vector/LongVector$LongSpecies (jdk/incubator/vector/VectorSpecies):exact * Oop:jdk/incubator/vector/LongVector$LongSpecies (jdk/incubator/vector/VectorSpecies):exact * 28 ConI === 0 [[ 168 ]] #int:1 44 ConI === 0 [[ 168 ]] #int:2 54 ConL === 0 [[ 255 233 168 168 199 ]] #long:2 104 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 124 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 167 ConP === 0 [[ 168 ]] #jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * Oop:jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * 168 CallStaticJava === 5 6 7 8 1 (104 124 44 54 1 28 23 167 54 1 1 1 1 1 1 1 ) [[ 169 181 182 173 ]] # Static jdk.internal.vm.vector.VectorSupport::fromBitsCoerced jdk/internal/vm/vector/VectorSupport$VectorPayload * ( java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, int, long, half, int, jdk/internal/vm/vector/VectorSupport$VectorSpecies *, java/lang/Object * ) VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongTo LongLong @ bci:22 (line 225) 169 Proj === 168 [[ 175 ]] #0 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 173 Proj === 168 [[ 226 190 278 278 222 ]] #5 Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 175 Catch === 169 181 [[ 176 177 ]] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 176 CatchProj === 175 [[ 192 ]] #0 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 177 CatchProj === 175 [[ 251 180 ]] #1 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 180 CreateEx === 177 181 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 181 Proj === 168 [[ 233 199 226 252 175 180 ]] #1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 182 Proj === 168 [[ 233 226 199 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 186 ConP === 0 [[ 296 ]] #precise jdk/incubator/vector/VectorMask: 0x0000000158383080:Constant:exact * Klass:precise jdk/incubator/vector/VectorMask: 0x0000000158383080:Constant:exact * 189 ConP === 0 [[ 190 199 ]] #null 190 CmpP === _ 173 189 [[ 191 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 191 Bool === _ 190 [[ 192 ]] [ne] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 192 If === 176 191 [[ 193 194 ]] P=0.999999, C=-1.000000 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 193 IfFalse === 192 [[ 199 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 194 IfTrue === 192 [[ 299 279 ]] #1 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 198 ConI === 0 [[ 199 ]] #int:-12 199 CallStaticJava === 193 181 182 8 9 (198 54 1 1 1 1 1 1 1 189 ) [[ 200 ]] # Static uncommon_trap(reason='null_check' action='make_not_entrant' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 200 Proj === 199 [[ 203 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 203 Halt === 200 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 222 CheckCastPP === 300 173 [[ 233 ]] #jdk/incubator/vector/VectorMask:NotNull * Oop:jdk/incubator/vector/VectorMask:NotNull * !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 225 ConI === 0 [[ 226 ]] #int:-34 226 CallStaticJava === 301 181 182 8 9 (225 1 1 1 1 1 1 1 1 173 ) [[ 227 ]] # Static uncommon_trap(reason='class_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 227 Proj === 226 [[ 230 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 230 Halt === 227 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 233 CallDynamicJava === 300 181 182 8 1 (222 54 1 1 1 ) [[ 234 246 247 238 ]] # Dynamic jdk.incubator.vector.VectorMask::toLong long/half ( jdk/incubator/vector/VectorMask:NotNull * ) VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 234 Proj === 233 [[ 240 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 238 Proj === 233 [[ 255 ]] #5 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 240 Catch === 234 246 [[ 241 242 ]] !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 241 CatchProj === 240 [[ 255 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 242 CatchProj === 240 [[ 251 245 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 245 CreateEx === 242 246 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 246 Proj === 233 [[ 255 252 240 245 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 247 Proj === 233 [[ 255 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 251 Region === 251 177 242 263 [[ 251 252 253 254 270 ]] !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 252 Phi === 251 181 246 257 [[ 270 ]] #abIO !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 253 Phi === 251 182 247 258 [[ 270 ]] #memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 254 Phi === 251 180 245 266 [[ 270 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:25 (line 225) 255 CallStaticJava === 241 246 247 8 1 (23 54 1 238 1 1 1 1 1 ) [[ 256 257 258 ]] # Static compiler.vectorapi.VectorMaskToLongTest::verifyMaskToLong void ( java/lang/Object *, long, half, long, half ) VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) 256 Proj === 255 [[ 261 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) 257 Proj === 255 [[ 269 261 252 266 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) 258 Proj === 255 [[ 269 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) 261 Catch === 256 257 [[ 262 263 ]] !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) 262 CatchProj === 261 [[ 269 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) 263 CatchProj === 261 [[ 251 266 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) 266 CreateEx === 263 257 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongLong @ bci:34 (line 226) 269 Return === 262 257 258 8 9 [[ 0 ]] 270 Rethrow === 251 252 253 8 9 exception 254 [[ 0 ]] 277 ConL === 0 [[ 278 ]] #long:8 278 AddP === _ 173 173 277 [[ 279 ]] Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload+8 * [narrowklass] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 279 LoadNKlass === 194 7 278 [[ 280 ]] @java/lang/Object+8 * [narrowklass], idx=5; #narrowklass: jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 280 DecodeNKlass === _ 279 [[ 285 285 ]] #jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 285 AddP === _ 280 280 295 [[ 286 ]] Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8+80 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 286 LoadKlass === _ 7 285 [[ 296 ]] @java/lang/Object: 0x000000014800a050+any *, idx=6; # * Klass: * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 295 ConL === 0 [[ 285 ]] #long:80 296 CmpP === _ 286 186 [[ 298 ]] !orig=[289] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 298 Bool === _ 296 [[ 299 ]] [ne] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 299 If === 194 298 [[ 300 301 ]] P=0.170000, C=-1.000000 !orig=[291] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 300 IfFalse === 299 [[ 222 233 ]] #0 !orig=[292],[273],[220] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 301 IfTrue === 299 [[ 226 ]] #1 !orig=[293],[274],[221] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongLong @ bci:22 (line 225) 5) Compilation of "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongShort()": > Phase "PrintIdeal": AFTER: print_ideal 0 Root === 0 203 230 269 270 [[ 0 1 3 225 198 189 23 186 167 28 44 124 104 54 277 295 ]] inner 1 Con === 0 [[ ]] #top 3 Start === 3 0 [[ 3 5 6 7 8 9 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address} 5 Parm === 3 [[ 168 ]] Control !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:-1 (line 196) 6 Parm === 3 [[ 168 ]] I_O !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:-1 (line 196) 7 Parm === 3 [[ 168 279 286 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:-1 (line 196) 8 Parm === 3 [[ 270 269 255 233 230 199 203 168 226 ]] FramePtr !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:-1 (line 196) 9 Parm === 3 [[ 270 269 199 226 ]] ReturnAdr !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:-1 (line 196) 23 ConP === 0 [[ 255 168 ]] #jdk/incubator/vector/ShortVector$ShortSpecies (jdk/incubator/vector/VectorSpecies):exact * Oop:jdk/incubator/vector/ShortVector$ShortSpecies (jdk/incubator/vector/VectorSpecies):exact * 28 ConI === 0 [[ 168 ]] #int:1 44 ConI === 0 [[ 168 ]] #int:8 54 ConL === 0 [[ 255 233 168 168 199 ]] #long:254 104 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 124 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * 167 ConP === 0 [[ 168 ]] #jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * Oop:jdk/incubator/vector/VectorMask$$Lambda+0x0000060001060748 (jdk/internal/vm/vector/VectorSupport$FromBitsCoercedOperation):exact * 168 CallStaticJava === 5 6 7 8 1 (104 124 44 54 1 28 23 167 54 1 1 1 1 1 1 1 ) [[ 169 181 182 173 ]] # Static jdk.internal.vm.vector.VectorSupport::fromBitsCoerced jdk/internal/vm/vector/VectorSupport$VectorPayload * ( java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact *, int, long, half, int, jdk/internal/vm/vector/VectorSupport$VectorSpecies *, java/lang/Object * ) VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongT oLongShort @ bci:22 (line 197) 169 Proj === 168 [[ 175 ]] #0 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 173 Proj === 168 [[ 226 190 278 278 222 ]] #5 Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 175 Catch === 169 181 [[ 176 177 ]] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 176 CatchProj === 175 [[ 192 ]] #0 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 177 CatchProj === 175 [[ 251 180 ]] #1 at bci -1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 180 CreateEx === 177 181 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 181 Proj === 168 [[ 233 199 226 252 175 180 ]] #1 !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 182 Proj === 168 [[ 233 226 199 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 186 ConP === 0 [[ 296 ]] #precise jdk/incubator/vector/VectorMask: 0x00000001484bb080:Constant:exact * Klass:precise jdk/incubator/vector/VectorMask: 0x00000001484bb080:Constant:exact * 189 ConP === 0 [[ 190 199 ]] #null 190 CmpP === _ 173 189 [[ 191 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 191 Bool === _ 190 [[ 192 ]] [ne] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 192 If === 176 191 [[ 193 194 ]] P=0.999999, C=-1.000000 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 193 IfFalse === 192 [[ 199 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 194 IfTrue === 192 [[ 299 279 ]] #1 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 198 ConI === 0 [[ 199 ]] #int:-12 199 CallStaticJava === 193 181 182 8 9 (198 54 1 1 1 1 1 1 1 189 ) [[ 200 ]] # Static uncommon_trap(reason='null_check' action='make_not_entrant' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 200 Proj === 199 [[ 203 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 203 Halt === 200 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 222 CheckCastPP === 300 173 [[ 233 ]] #jdk/incubator/vector/VectorMask:NotNull * Oop:jdk/incubator/vector/VectorMask:NotNull * !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 225 ConI === 0 [[ 226 ]] #int:-34 226 CallStaticJava === 301 181 182 8 9 (225 1 1 1 1 1 1 1 1 173 ) [[ 227 ]] # Static uncommon_trap(reason='class_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 227 Proj === 226 [[ 230 ]] #0 !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 230 Halt === 227 1 1 8 1 [[ 0 ]] !jvms: VectorMask::fromLong @ bci:42 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 233 CallDynamicJava === 300 181 182 8 1 (222 54 1 1 1 ) [[ 234 246 247 238 ]] # Dynamic jdk.incubator.vector.VectorMask::toLong long/half ( jdk/incubator/vector/VectorMask:NotNull * ) VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 234 Proj === 233 [[ 240 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 238 Proj === 233 [[ 255 ]] #5 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 240 Catch === 234 246 [[ 241 242 ]] !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 241 CatchProj === 240 [[ 255 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 242 CatchProj === 240 [[ 251 245 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 245 CreateEx === 242 246 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 246 Proj === 233 [[ 255 252 240 245 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 247 Proj === 233 [[ 255 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 251 Region === 251 177 242 263 [[ 251 252 253 254 270 ]] !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 252 Phi === 251 181 246 257 [[ 270 ]] #abIO !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 253 Phi === 251 182 247 258 [[ 270 ]] #memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 254 Phi === 251 180 245 266 [[ 270 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:25 (line 197) 255 CallStaticJava === 241 246 247 8 1 (23 54 1 238 1 1 1 1 1 ) [[ 256 257 258 ]] # Static compiler.vectorapi.VectorMaskToLongTest::verifyMaskToLong void ( java/lang/Object *, long, half, long, half ) VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) 256 Proj === 255 [[ 261 ]] #0 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) 257 Proj === 255 [[ 269 261 252 266 ]] #1 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) 258 Proj === 255 [[ 269 253 ]] #2 Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) 261 Catch === 256 257 [[ 262 263 ]] !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) 262 CatchProj === 261 [[ 269 ]] #0 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) 263 CatchProj === 261 [[ 251 266 ]] #1 at bci -1 !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) 266 CreateEx === 263 257 [[ 254 ]] #java/lang/Throwable (java/io/Serializable):NotNull * Oop:java/lang/Throwable (java/io/Serializable):NotNull * !jvms: VectorMaskToLongTest::testFromLongToLongShort @ bci:34 (line 198) 269 Return === 262 257 258 8 9 [[ 0 ]] 270 Rethrow === 251 252 253 8 9 exception 254 [[ 0 ]] 277 ConL === 0 [[ 278 ]] #long:8 278 AddP === _ 173 173 277 [[ 279 ]] Oop:jdk/internal/vm/vector/VectorSupport$VectorPayload+8 * [narrowklass] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 279 LoadNKlass === 194 7 278 [[ 280 ]] @java/lang/Object+8 * [narrowklass], idx=5; #narrowklass: jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 280 DecodeNKlass === _ 279 [[ 285 285 ]] #jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 285 AddP === _ 280 280 295 [[ 286 ]] Klass:jdk/internal/vm/vector/VectorSupport$VectorPayload: 0x000000014800e0e8+80 * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 286 LoadKlass === _ 7 285 [[ 296 ]] @java/lang/Object: 0x000000014800a050+any *, idx=6; # * Klass: * !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 295 ConL === 0 [[ 285 ]] #long:80 296 CmpP === _ 286 186 [[ 298 ]] !orig=[289] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 298 Bool === _ 296 [[ 299 ]] [ne] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 299 If === 194 298 [[ 300 301 ]] P=0.170000, C=-1.000000 !orig=[291] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 300 IfFalse === 299 [[ 222 233 ]] #0 !orig=[292],[273],[220] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) 301 IfTrue === 299 [[ 226 ]] #1 !orig=[293],[274],[221] !jvms: VectorMask::fromLong @ bci:39 (line 243) VectorMaskToLongTest::testFromLongToLongShort @ bci:22 (line 197) [...] One or more @IR rules failed: Failed IR Rules (5) of Methods (5) ---------------------------------- 1) Method "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongByte()" - [Failed IR rules: 1]: * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_LONG_TO_MASK#_", "= 0", "_#VECTOR_MASK_TO_LONG#_", "= 1"}, failOn={}, applyIfPlatformOr={}, applyIfPlatform={}, applyIfOr={}, applyIfCPUFeatureAnd={"asimd", "true", "sve", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" > Phase "PrintIdeal": - counts: Graph contains wrong number of nodes: * Constraint 2: "(\\d+(\\s){2}(VectorMaskToLong.*)+(\\s){2}===.*)" - Failed comparison: [found] 0 = 1 [given] - No nodes matched! 2) Method "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongDouble()" - [Failed IR rules: 1]: * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_LONG_TO_MASK#_", "= 0", "_#VECTOR_MASK_TO_LONG#_", "= 1"}, failOn={}, applyIfPlatformOr={}, applyIfPlatform={}, applyIfOr={}, applyIfCPUFeatureAnd={"asimd", "true", "sve", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" > Phase "PrintIdeal": - counts: Graph contains wrong number of nodes: * Constraint 2: "(\\d+(\\s){2}(VectorMaskToLong.*)+(\\s){2}===.*)" - Failed comparison: [found] 0 = 1 [given] - No nodes matched! 3) Method "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongInt()" - [Failed IR rules: 1]: * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_LONG_TO_MASK#_", "= 0", "_#VECTOR_MASK_TO_LONG#_", "= 1"}, failOn={}, applyIfPlatformOr={}, applyIfPlatform={}, applyIfOr={}, applyIfCPUFeatureAnd={"asimd", "true", "sve", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" > Phase "PrintIdeal": - counts: Graph contains wrong number of nodes: * Constraint 2: "(\\d+(\\s){2}(VectorMaskToLong.*)+(\\s){2}===.*)" - Failed comparison: [found] 0 = 1 [given] - No nodes matched! 4) Method "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongLong()" - [Failed IR rules: 1]: * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_LONG_TO_MASK#_", "= 0", "_#VECTOR_MASK_TO_LONG#_", "= 1"}, failOn={}, applyIfPlatformOr={}, applyIfPlatform={}, applyIfOr={}, applyIfCPUFeatureAnd={"asimd", "true", "sve", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" > Phase "PrintIdeal": - counts: Graph contains wrong number of nodes: * Constraint 2: "(\\d+(\\s){2}(VectorMaskToLong.*)+(\\s){2}===.*)" - Failed comparison: [found] 0 = 1 [given] - No nodes matched! 5) Method "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongShort()" - [Failed IR rules: 1]: * @IR rule 2: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#VECTOR_LONG_TO_MASK#_", "= 0", "_#VECTOR_MASK_TO_LONG#_", "= 1"}, failOn={}, applyIfPlatformOr={}, applyIfPlatform={}, applyIfOr={}, applyIfCPUFeatureAnd={"asimd", "true", "sve", "false"}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})" > Phase "PrintIdeal": - counts: Graph contains wrong number of nodes: * Constraint 2: "(\\d+(\\s){2}(VectorMaskToLong.*)+(\\s){2}===.*)" - Failed comparison: [found] 0 = 1 [given] - No nodes matched!
------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3131236429 From fyang at openjdk.org Tue Jul 29 08:40:35 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 29 Jul 2025 08:40:35 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v5] In-Reply-To: References: Message-ID: <8N7cCiZeQ3dTGViz9_mj55YnfS7qh0T-02f4h0ZVUnM=.0aa9aea5-3db4-4251-a949-bc73f451ca8e@github.com> > Hi, please consider this small change. > > JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. > > We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call > and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. > > Testing on linux-riscv64: > - [x] tier1-tier3 (release build) > - [x] hs:tier1-hs:tier3 (fastdebug build) > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 Fei Yang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge remote-tracking branch 'upstream/master' into JDK-8364150 - Assert - Merge remote-tracking branch 'upstream/master' into JDK-8364150 - Comment - 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26495/files - new: https://git.openjdk.org/jdk/pull/26495/files/8fa3d037..32ee090f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=03-04 Stats: 341 lines in 19 files changed: 254 ins; 54 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/26495.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26495/head:pull/26495 PR: https://git.openjdk.org/jdk/pull/26495 From duke at openjdk.org Tue Jul 29 10:16:57 2025 From: duke at openjdk.org (erifan) Date: Tue, 29 Jul 2025 10:16:57 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v7] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 09:27:19 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Move the assertion to the beginning of the code block > Testing is currently slow - still running but I report what I have so far. There is one test failure on `linux-aarch64-debug` and `macosx-aarch64-debug` with the new test `VectorMaskToLongTest.java`: > > Additional flags: `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation` (probably only related to `-XX:-TieredComilation`, maybe we have not enough profiling and need to increase the warm-up but that's just a wild guess without looking at the test) > > Log Thanks @chhagedorn , and yes you are right. I can reproduce the failure with `-XX:-TieredComilation` on NEON system. And increasing the default warm up value fixes the issue. I'll update the code tomorrow. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3131740432 From mli at openjdk.org Tue Jul 29 10:42:54 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 29 Jul 2025 10:42:54 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v5] In-Reply-To: <_6L3mwdTfWCFoSohvs3SejxeeRIxW-XZQquD-I_Nay8=.951d51f8-0e98-4047-a24d-156c6b9d2e18@github.com> References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> <2_gymTx2AihIEDELcrUUe9jIOq6QlKxZtl7rUQaCTgg=.d3246389-7bf8-4206-b40d-0fbe47436f37@github.com> <_6L3mwdTfWCFoSohvs3SejxeeRIxW-XZQquD-I_Nay8=.951d51f8-0e98-4047-a24d-156c6b9d2e18@github.com> Message-ID: On Tue, 29 Jul 2025 00:45:00 GMT, Fei Yang wrote: >> I intend to think that by design we only want `NativeCall::reloc_destination` and `NativeCall::reloc_set_destination` to get and set stub address. There are other functions like `NativeCall::destination`, `NativeCall::set_destination` and `NativeCall::set_destination_mt_safe` which are supposed to deal with the real call target. So I am going to check the `CallRelocation::set_destination` cases in shared code as you mentioned. > > Here is what I find. There are only two use cases of `CallRelocation::set_destination(address x)` in hotspot shared code. One is the AOT case [1] and the other is `CallRelocation::set_value(address x)` [2]. And `CallRelocation::set_value(address x)` is never used. Given that we don't have AOT support for riscv64 in JDK upstream yet, I think we are safe. > > And I have added assertion in `CallRelocation::set_value(address x)` about `dest` param to make sure it's the same as the stub address. I see `hs:tier1` still test good with fastdebug build. Also I think it's better to investigate and fix the AOT use case in Leyden premain as you are currently working on if it turns out to be an issue. Does that work for you? > > > ./relocInfo.hpp: void set_value(address x) override { set_destination(x); } > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/aotCodeCache.cpp#L1111 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L955 Thanks for investigating. The pr makes sense to me. Can you just help to add an TODO or FIXUP here to state that during the implementation of aot series, we might need to revisit here? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2239358922 From fyang at openjdk.org Tue Jul 29 12:08:32 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 29 Jul 2025 12:08:32 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v6] In-Reply-To: References: Message-ID: > Hi, please consider this small change. > > JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. > > We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call > and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. > > Testing on linux-riscv64: > - [x] tier1-tier3 (release build) > - [x] hs:tier1-hs:tier3 (fastdebug build) > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 Fei Yang has updated the pull request incrementally with one additional commit since the last revision: Review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26495/files - new: https://git.openjdk.org/jdk/pull/26495/files/32ee090f..2839d9d2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26495&range=04-05 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26495.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26495/head:pull/26495 PR: https://git.openjdk.org/jdk/pull/26495 From fyang at openjdk.org Tue Jul 29 12:08:32 2025 From: fyang at openjdk.org (Fei Yang) Date: Tue, 29 Jul 2025 12:08:32 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v6] In-Reply-To: References: <_66H8SCUT2si17IPmfPJ7g3tj9QW6m3Q7YOGMOsBKI8=.b1a9d462-df1f-4bb3-b4d0-07deadf28749@github.com> <9jNyVzYYc0O9nHTJYWpgbqkcV-mMzyKUvdgzpGNf6qM=.bb7b8abb-fc13-4057-8559-cc9a92a56393@github.com> <2_gymTx2AihIEDELcrUUe9jIOq6QlKxZtl7rUQaCTgg=.d3246389-7bf8-4206-b40d-0fbe47436f37@github.com> <_6L3mwdTfWCFoSohvs3SejxeeRIxW-XZQquD-I_Nay8=.951d51f8-0e98-4047-a24d-156c6b9d2e18@github.com> Message-ID: <25TWyckXRpAhJQR8cTHMUUgUA7zqJvHbWmHNoW7iY5E=.a603813b-4b51-4334-94ac-fb20e11ba47d@github.com> On Tue, 29 Jul 2025 10:40:45 GMT, Hamlin Li wrote: >> Here is what I find. There are only two use cases of `CallRelocation::set_destination(address x)` in hotspot shared code. One is the AOT case [1] and the other is `CallRelocation::set_value(address x)` [2]. And `CallRelocation::set_value(address x)` is never used. Given that we don't have AOT support for riscv64 in JDK upstream yet, I think we are safe. >> >> And I have added assertion in `CallRelocation::set_value(address x)` about `dest` param to make sure it's the same as the stub address. I see `hs:tier1` still test good with fastdebug build. Also I think it's better to investigate and fix the AOT use case in Leyden premain as you are currently working on if it turns out to be an issue. Does that work for you? >> >> >> ./relocInfo.hpp: void set_value(address x) override { set_destination(x); } >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/aotCodeCache.cpp#L1111 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L955 > > Thanks for investigating. The pr makes sense to me. Can you just help to add an TODO or FIXUP here to state that during the implementation of aot series, we might need to revisit here? > My concern is existing test might or might not catch the issue, currently I'm not sure, so it's better to have a label here to catch the attention at that time, so I will review the related shared code in more details when implement the riscv part. Sure. I have added TODO comments for both functions. Let me know. Thanks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26495#discussion_r2239584900 From mli at openjdk.org Tue Jul 29 12:31:55 2025 From: mli at openjdk.org (Hamlin Li) Date: Tue, 29 Jul 2025 12:31:55 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v6] In-Reply-To: References: Message-ID: On Tue, 29 Jul 2025 12:08:32 GMT, Fei Yang wrote: >> Hi, please consider this small change. >> >> JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. >> >> We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call >> and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. >> >> Testing on linux-riscv64: >> - [x] tier1-tier3 (release build) >> - [x] hs:tier1-hs:tier3 (fastdebug build) >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 > > Fei Yang has updated the pull request incrementally with one additional commit since the last revision: > > Review Thank you! Looks good. ------------- Marked as reviewed by mli (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26495#pullrequestreview-3067283595 From duke at openjdk.org Tue Jul 29 16:36:03 2025 From: duke at openjdk.org (duke) Date: Tue, 29 Jul 2025 16:36:03 GMT Subject: Withdrawn: 8355574: Fatal error in abort_verify_int_in_range due to Invalid CastII In-Reply-To: References: Message-ID: On Sat, 17 May 2025 14:54:43 GMT, Quan Anh Mai wrote: > Hi, > > The issue here is that the `CastLLNode` is created before the actual check that ensures the range of the input. This patch fixes it by moving the creation to the correct place, which is under `inline_block`. I also noticed that the code there seems incorrect and confusing. `ArrayCopyNode::get_partial_inline_vector_lane_count` takes the length of the array, not the size in bytes. If you look into the method it will multiply `const_len` with `type2aelementbytes(bt)` to get the size in bytes of the array. In the runtime test, we compare `length << log2(type2bytes(bt))` with `ArrayOperationPartialInlineSize`. This seems confusing, why don't we just compare `length` with `ArrayOperationPartialInlineSize / type2bytes(bt)`, it also unifies the test with the actual cast. > > Please take a look and leave your reviews, thanks a lot. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/25284 From kvn at openjdk.org Tue Jul 29 17:01:56 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 29 Jul 2025 17:01:56 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph In-Reply-To: References: Message-ID: On Thu, 17 Jul 2025 07:25:10 GMT, Marc Chevalier wrote: > Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. > > Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. > > This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. > > For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. > > On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: > > 1 failure for node > 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > At node > 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) > From path: > [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) > <-(0)- 210 IfFalse === 209 [[ 215 216 ]] #0 !orig=198 !jvms: StringL... I am fine with `VerifyIdealGraph` flag. The main concern is we have tons of `Verify*` flags but I don't think we use them in CI testing. So we are forgetting about them, they will brake and few years later we are removing them like we did with `VerifyOpto`. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26362#issuecomment-3133341084 From duke at openjdk.org Tue Jul 29 17:42:13 2025 From: duke at openjdk.org (duke) Date: Tue, 29 Jul 2025 17:42:13 GMT Subject: Withdrawn: 8347555: [REDO] C2: implement optimization for series of Add of unique value In-Reply-To: References: Message-ID: On Thu, 6 Feb 2025 23:29:51 GMT, Kangcheng Xu wrote: > [JDK-8347555](https://bugs.openjdk.org/browse/JDK-8347555) is a redo of [JDK-8325495](https://bugs.openjdk.org/browse/JDK-8325495) was [first merged](https://git.openjdk.org/jdk/pull/20754) then backed out due to a regression. This patch redos the feature and fixes the bit shift overflow problem. For more information please refer to the previous PR. > > When constanlizing multiplications (possibly in forms on `lshifts`), the multiplier is upgraded to long and then later narrowed to int if needed. However, when a `lshift` operand is exactly `32`, overflowing an int, using long has an unexpected result. (i.e., `(1 << 32) = 1` and `(int) (1L << 32) = 0`) > > The following was implemented to address this issue. > > if (UseNewCode2) { > *multiplier = bt == T_INT > ? (jlong) (1 << con->get_int()) // loss of precision is expected for int as it overflows > : ((jlong) 1) << con->get_int(); > } else { > *multiplier = ((jlong) 1 << con->get_int()); > } > > > Two new bitshift overflow tests were added. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/23506 From bulasevich at openjdk.org Tue Jul 29 18:24:16 2025 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Tue, 29 Jul 2025 18:24:16 GMT Subject: RFR: 8316694: Implement relocation of nmethod within CodeCache [v40] In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 01:20:46 GMT, Chad Rakoczy wrote: >> This PR introduces a new function to replace nmethods, addressing [JDK-8316694](https://bugs.openjdk.org/browse/JDK-8316694). It enables the creation of new nmethods from existing ones, allowing method relocation in the code heap and supporting [JDK-8328186](https://bugs.openjdk.org/browse/JDK-8328186). >> >> When an nmethod is replaced, a deep copy is performed. The corresponding Java method is updated to reference the new nmethod, while the old one is marked as unused. The garbage collector handles final cleanup and deallocation. >> >> This does not modify existing code paths and therefore does not benefit much from existing tests. New tests were created to test the new functionality >> >> Additional Testing: >> - [x] Linux x64 fastdebug tier 1/2/3/4 >> - [x] Linux aarch64 fastdebug tier 1/2/3/4 > > Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision: > > Use CompiledICLocker instead of CompiledIC_lock Marked as reviewed by bulasevich (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/23573#pullrequestreview-3068694877 From fyang at openjdk.org Wed Jul 30 01:05:13 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 30 Jul 2025 01:05:13 GMT Subject: RFR: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call [v5] In-Reply-To: References: <_gBsBRuwEYg_z4Fy1eTSI0ATppAF85SFB9fNkXSwe8E=.7c88580a-bb52-4066-977e-29c84b8b8b56@github.com> Message-ID: On Tue, 29 Jul 2025 01:22:52 GMT, Feilong Jiang wrote: >> Fei Yang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: >> >> - Merge remote-tracking branch 'upstream/master' into JDK-8364150 >> - Assert >> - Merge remote-tracking branch 'upstream/master' into JDK-8364150 >> - Comment >> - 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call > > Looks good, thanks! @feilongjiang @Hamlin-Li : Thanks for the reviews. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26495#issuecomment-3134543384 From fyang at openjdk.org Wed Jul 30 01:05:14 2025 From: fyang at openjdk.org (Fei Yang) Date: Wed, 30 Jul 2025 01:05:14 GMT Subject: Integrated: 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 04:05:20 GMT, Fei Yang wrote: > Hi, please consider this small change. > > JDK-8343430 removed the old trampoline call on RISC-V. And the new solution (reloc call) loads a target address from stub section and do an indirect call. This means the stub is always there for a NativeCall. So there's no need to check existence of the stub when doing `CallRelocation::fix_relocation_after_move` [1]. > > We can always return the stub address in `NativeCall::reloc_destination` and use that address in `NativeCall::reloc_set_destination`. This helps simplify the code and saves one `MacroAssembler::target_addr_for_insn` call > and one `trampoline_stub_Relocation::get_trampoline_for` call in these two functions respectively. > > Testing on linux-riscv64: > - [x] tier1-tier3 (release build) > - [x] hs:tier1-hs:tier3 (fastdebug build) > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.cpp#L404-L406 This pull request has now been integrated. Changeset: 3488f53d Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/3488f53d2c3083bd886644684ec6885046ea7f8e Stats: 14 lines in 2 files changed: 3 ins; 6 del; 5 mod 8364150: RISC-V: Leftover for JDK-8343430 removing old trampoline call Reviewed-by: mli, fjiang ------------- PR: https://git.openjdk.org/jdk/pull/26495 From duke at openjdk.org Wed Jul 30 06:14:40 2025 From: duke at openjdk.org (erifan) Date: Wed, 30 Jul 2025 06:14:40 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v8] In-Reply-To: References: Message-ID: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> > If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is > relative smaller than that of `fromLong`. So this patch does the conversion for these cases. > > The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. > > Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. > > This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. > > As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like > > VectorMaskToLong (VectorLongToMask x) => x > > > Hence, this patch also added the following optimizations: > > VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 > > VectorMaskCast (VectorMaskCast x) => x > > And we can see noticeable performance improvement with the above optimizations for floating-point types. > > Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 > microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 > microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 > microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 > microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 > microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 > microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 > microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 > > > Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: > > Benchmark Unit Before Error After Error Uplift > microMaskFromLongToLong_Double... erifan has updated the pull request incrementally with one additional commit since the last revision: Set default warm up to 10000 for JTReg tests ------------- Changes: - all: https://git.openjdk.org/jdk/pull/25793/files - new: https://git.openjdk.org/jdk/pull/25793/files/8418ebdd..b1a768eb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=25793&range=06-07 Stats: 6 lines in 3 files changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/25793.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/25793/head:pull/25793 PR: https://git.openjdk.org/jdk/pull/25793 From duke at openjdk.org Wed Jul 30 06:16:57 2025 From: duke at openjdk.org (erifan) Date: Wed, 30 Jul 2025 06:16:57 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v7] In-Reply-To: References: Message-ID: On Tue, 29 Jul 2025 08:21:57 GMT, Christian Hagedorn wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Move the assertion to the beginning of the code block > > Testing is currently slow - still running but I report what I have so far. There is one test failure on `linux-aarch64-debug` and `macosx-aarch64-debug` with the new test `VectorMaskToLongTest.java`: > > Additional flags: `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation` (probably only related to `-XX:-TieredComilation`, maybe we have not enough profiling and need to increase the warm-up but that's just a wild guess without looking at the test) > >
> Log > > Compilations (5) of Failed Methods (5) > -------------------------------------- > 1) Compilation of "public static void compiler.vectorapi.VectorMaskToLongTest.testFromLongToLongByte()": >> Phase "PrintIdeal": > AFTER: print_ideal > 0 Root === 0 203 230 269 270 [[ 0 1 3 225 198 189 23 186 167 28 44 124 104 54 277 295 ]] inner > 1 Con === 0 [[ ]] #top > 3 Start === 3 0 [[ 3 5 6 7 8 9 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address} > 5 Parm === 3 [[ 168 ]] Control !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) > 6 Parm === 3 [[ 168 ]] I_O !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) > 7 Parm === 3 [[ 168 279 286 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) > 8 Parm === 3 [[ 270 269 255 233 230 199 203 168 226 ]] FramePtr !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) > 9 Parm === 3 [[ 270 269 199 226 ]] ReturnAdr !jvms: VectorMaskToLongTest::testFromLongToLongByte @ bci:-1 (line 182) > 23 ConP === 0 [[ 255 168 ]] #jdk/incubator/vector/ByteVector$ByteSpecies (jdk/incubator/vector/VectorSpecies):exact * Oop:jdk/incubator/vector/ByteVector$ByteSpecies (jdk/incubator/vector/VectorSpecies):exact * > 28 ConI === 0 [[ 168 ]] #int:1 > 44 ConI === 0 [[ 168 ]] #int:16 > 54 ConL === 0 [[ 255 233 168 168 199 ]] #long:65534 > 104 ConP === 0 [[ 168 ]] #java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * Oop:java/lang/Class (java/io/Serializable,java/lang/constant/Constable,java/lang/reflect/AnnotatedElement,java/lang/invoke/TypeDescriptor,java/lang/reflect/GenericDeclaration,java/lang/reflect/Type,java/lang/invoke/TypeDescriptor$OfField):exact * > 124 ... Hi @chhagedorn , I have increased the warm up times, could you help test the PR again ? Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3134981739 From hgreule at openjdk.org Wed Jul 30 07:09:57 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Wed, 30 Jul 2025 07:09:57 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v8] In-Reply-To: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> References: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> Message-ID: On Wed, 30 Jul 2025 06:14:40 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Set default warm up to 10000 for JTReg tests I think there are a few (follow-up?) improvements that can be made: 1. Using KnownBits and checking against that rather than requiring a constant in `is_maskall_type`. This is probably a bit difficult to test for now. 2. If the range of an input is known to be [-1, 0], we can use that as an input for a MaskAllNode. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3135107549 From mchevalier at openjdk.org Wed Jul 30 08:27:37 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Wed, 30 Jul 2025 08:27:37 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v2] In-Reply-To: References: Message-ID: <1A8oR7hEgev2U_ys1H_AVJS5kjw6LWoPgrVPhJXSFqI=.34cbd04b-bf88-441f-9c3d-97f9aee7f3c3@github.com> > Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. > > Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. > > This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. > > For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. > > On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: > > 1 failure for node > 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > At node > 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) > From path: > [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 > <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) > <-(0)- 210 IfFalse === 209 [[ 215 216 ]] #0 !orig=198 !jvms: StringL... Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: Rename flag as suggested ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26362/files - new: https://git.openjdk.org/jdk/pull/26362/files/944a8fe4..9117fde8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26362&range=00-01 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26362.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26362/head:pull/26362 PR: https://git.openjdk.org/jdk/pull/26362 From jsjolen at openjdk.org Wed Jul 30 09:06:04 2025 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Wed, 30 Jul 2025 09:06:04 GMT Subject: RFR: 8352112: [ubsan] hotspot/share/code/relocInfo.cpp:130:37: runtime error: applying non-zero offset 18446744073709551614 to null pointer [v2] In-Reply-To: <7ja3_KpFi1NPc4EPFpMk3af7RgGtQYu0zGmrv05lCj0=.a7fb616e-8923-47f1-b869-3bb064d27f58@github.com> References: <7ja3_KpFi1NPc4EPFpMk3af7RgGtQYu0zGmrv05lCj0=.a7fb616e-8923-47f1-b869-3bb064d27f58@github.com> Message-ID: On Mon, 28 Jul 2025 14:39:26 GMT, Vladimir Kozlov wrote: >> Right. This typo was fixed in https://github.com/openjdk/jdk/pull/26175 >> For now I do not see how this change is related with [JDK-8361382: NMT corruption](https://bugs.openjdk.org/browse/JDK-8361382) > > Yes, it was fixed. And they were harmless. > > I think @jdksjolen linked it because of call stack. But I also don't know how it could cause NMT bug. > @jdksjolen did you try to to undo these changes and reproduce https://bugs.openjdk.org/browse/JDK-8361382 ? > > > V [libjvm.dylib+0xbf1c8c] VMError::report(outputStream*, bool)+0xa9c (mallocHeader.inline.hpp:107) > V [libjvm.dylib+0xbf5d25] VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x575 > V [libjvm.dylib+0x404e20] DebuggingContext::~DebuggingContext()+0x0 > V [libjvm.dylib+0x8f770f] MallocHeader* MallocHeader::resolve_checked_impl(void*)+0x15f > V [libjvm.dylib+0x8f720c] MallocTracker::record_free_block(void*)+0xc > V [libjvm.dylib+0x9a719a] os::free(void*)+0xea > V [libjvm.dylib+0x388fb4] CodeBlob::purge()+0x44 > V [libjvm.dylib+0x978e98] nmethod::purge(bool)+0x308 > V [libjvm.dylib+0x380439] ClassUnloadingContext::purge_nmethods()+0x69 My reasoning was based on the fact that what used to be set to a constant, `nullptr`, is no longer set to a constant. I'm saying that `blob_end()` isn't a constant, because it can be changed with `adjust_size()`, but I think that @vnkozlov said that that's unlikely, as it's only changed for interpreter stubs and not nmethods. The only other possibility is that `nmethod`s are copied, so their `this` pointer changes, this will make `blob_end()` change, and this may incur a double free. This double free is detected by NMT, which leads to the crash. I think it may still be best to 'fix' this by setting the `_mutable_data` to `nullptr` again and fixing the iterators, as it does simplify reasoning around this (and imho, understanding the code). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24102#discussion_r2241999516 From chagedorn at openjdk.org Wed Jul 30 09:07:58 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 30 Jul 2025 09:07:58 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v7] In-Reply-To: References: Message-ID: <3yPAlpt3kRdPI4dCUey57DXQvlF8QFkhJpNiv4742Og=.4889e854-5f03-4d01-87e0-14938779e396@github.com> On Tue, 29 Jul 2025 10:14:20 GMT, erifan wrote: > And increasing the default warm up value fixes the issue. Nice, that's good to hear! > Hi @chhagedorn , I have increased the warm up times, could you help test the PR again ? Thanks! Thanks for coming back with a fix! I'll resubmit testing and report back again. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3135443727 From mhaessig at openjdk.org Wed Jul 30 12:05:36 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 30 Jul 2025 12:05:36 GMT Subject: RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression Message-ID: A loop of the form MemorySegment ms = {}; for (long i = 0; i < ms.byteSize() / 8L; i++) { // vectorizable work } does not vectorize, whereas MemorySegment ms = {}; long size = ms.byteSize(); for (long i = 0; i < size / 8L; i++) { // vectorizable work } vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the `LShift` used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting `LShift`s modifying the IV of a *counted loop*: https://github.com/openjdk/jdk/blob/e3f85c961b4c1e5e01aedf3a0f4e1b0e6ff457fd/src/hotspot/share/opto/loopopts.cpp#L1172-L1176 Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization. So, why is the counted loop not detected? Because the call to `byteSize()` is inside the loop head, and `CiTypeFlow::clone_loop_heads()` duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first `ITER_GVN` in `PHASEIDEALLOOP1` will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization. @merykitty also provides an alternative explanation. A node is only split through a phi if that splitting is profitable. While the split looks to be profitable in the example above, it only generates wins on the loop entry edge. This ends up destroying the canonical loop structure and prevents further optimization. Other issues like [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) suffer from the same problem ## Change Description Based on @merykitty's reasoning, this PR tracks if wins in `split_through_phi()` are on the loop entry edge or the loop backedge. If there are wins on a loop entry edge, we do not consider the split to be profitable unless there are a lot of wins on the backedge.
Explored Alternatives 1. Prevent splitting `LShift`s in uncounted loops that have the same shape as a counted loop would have. This fixes this specific issue, but causes potential regressions with uncounted loops. 2. Insert a "`PHASEIDEALLOOP0`" with `LoopOptsNone` that only performs loop tree building and then a round of IGVN where `Loop` nodes have been created. This cleans up the duplicated loop limit field access inside the loop, which enables the counted loop detection in `PHASEIDEALLOOP1`. This fixes this issue and a few others, but has loads of unforeseen consequences for loopopts down the line, including some regressions.
This solution also has an impact on some tests: - `compiler/loopopts/InvariantCodeMotionReassociateAddSub.java` observes fewer `AddI` nodes ([d9a59af](https://github.com/openjdk/jdk/pull/26429/commits/d9a59af977da70575a1e215c504958b1fb3db6a6)) - `compiler/vectorization/runner/ArrayIndexFillTest.java` only remains with the `fillLongArray` case attributed to [JDK-8332878](https://bugs.openjdk.org/browse/JDK-8332878) and the previously failing floating point cases fixed ([5839f15](https://github.com/openjdk/jdk/pull/26429/commits/5839f157cae57f80fb041251a0a28327a0970fae)) - `compiler/loopopts/superword/TestMemorySegment.java` shows that the failing test cases tracked by [JDK-8331659](https://bugs.openjdk.org/browse/JDK-8331659) pass now ([63689f8](https://github.com/openjdk/jdk/pull/26429/commits/63689f84b364828f7b50979acf1443498dddd1da)) - the reproducer from [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) is fixed with this PR. Added `TestMemorySegmentField.java` as regression test. ## Testing - [x] Github Actions - [x] tier1 - tier3 plus some internal testing on Oracle supported platforms - [x] tier4 - tier6 on Oracle supported platforms - [ ] SPECjbb2025, SPECjvm2008, Dacapo23 ## Acknowledgements Big thanks to @merykitty for coming up with the solution to this issue and providing feedback, as well as @eme64, @chhagedorn, @TobiHartmann, and @rwestrel for discussing this issue and providing feedback. ------------- Commit messages: - Address review comments - Add test from JDK-8348096 - Fix cases of JDK-8332878 not caused by push through add - Adjust previously failing tests tracked by JDK-8331659 - Adjust for eliminated nodes in InvariantCodeMotionReassociateAddSub.java - Split only profitable when not on entry edge - Add regression test Changes: https://git.openjdk.org/jdk/pull/26429/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26429&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8356176 Stats: 228 lines in 7 files changed: 187 ins; 18 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/26429.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26429/head:pull/26429 PR: https://git.openjdk.org/jdk/pull/26429 From qamai at openjdk.org Wed Jul 30 12:05:37 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 30 Jul 2025 12:05:37 GMT Subject: RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression In-Reply-To: References: Message-ID: <2BCEe5coWSwvmmoUBLWZlzBs81azC4xeekoxZgLv_7I=.61ff0c53-5671-471e-96fb-85875116b5ac@github.com> On Tue, 22 Jul 2025 15:05:29 GMT, Manuel H?ssig wrote: > A loop of the form > > MemorySegment ms = {}; > for (long i = 0; i < ms.byteSize() / 8L; i++) { > // vectorizable work > } > > does not vectorize, whereas > > MemorySegment ms = {}; > long size = ms.byteSize(); > for (long i = 0; i < size / 8L; i++) { > // vectorizable work > } > > vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the `LShift` used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting `LShift`s modifying the IV of a *counted loop*: > > https://github.com/openjdk/jdk/blob/e3f85c961b4c1e5e01aedf3a0f4e1b0e6ff457fd/src/hotspot/share/opto/loopopts.cpp#L1172-L1176 > > Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization. > > So, why is the counted loop not detected? Because the call to `byteSize()` is inside the loop head, and `CiTypeFlow::clone_loop_heads()` duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first `ITER_GVN` in `PHASEIDEALLOOP1` will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization. > > @merykitty also provides an alternative explanation. A node is only split through a phi if that splitting is profitable. While the split looks to be profitable in the example above, it only generates wins on the loop entry edge. This ends up destroying the canonical loop structure and prevents further optimization. Other issues like [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) suffer from the same problem > > ## Change Description > > Based on @merykitty's reasoning, this PR tracks if wins in `split_through_phi()` are on the loop entry edge or the loop backedge. If there are wins on a loop entry edge, we do not consider the split to be profitable unless there are a lot of wins on the backedge. > >
Explored Alternatives > 1. Prevent splitting `LShift`s in uncounted loops that have the same shape as a counted loop would have. This fixes this specific issue, but causes potential regressions with uncounted loops. > 2. Insert a "`PHASEIDEALLOOP0`" with `LoopOptsNone` that only perfor... Thanks a lot for your work, I have some suggestions. LGTM, I have some small suggestions. I think one possible solution is to avoid splitting through `Phi` if there is no benefit. In this case, the only benefit is in the loop entry, while there is none in the loop backedge. If we take frequency into consideration, we may be able to determine that the splitting is not profitable. What do you think? >From the principle point of view, splitting a node through the loop `Phi` is only profitable if the profit is in the loop backedge. From the practical point of view, there are some issues when `split_through_phi` is applied recklessly such as [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096). I believe taking loop head into consideration when splitting through `Phi`s can solve these issues. As a result, I think while you are at this issue, it is worth investigating this approach. src/hotspot/share/opto/loopnode.hpp line 1639: > 1637: class SplitWins { > 1638: private: > 1639: uint _total_wins; I think using `int` here is totally fine, it frees you from all the refactoring `int` to `uint`, too. src/hotspot/share/opto/loopnode.hpp line 1655: > 1653: } > 1654: if (region->is_Loop() && ctrl_index == LoopNode::LoopBackControl) { > 1655: _backedge_wins++; Since the corresponding thing is called `LoopBack` in `LoopNode`, calling this `loop_back_wins` would be a little bit more consistent. src/hotspot/share/opto/loopnode.hpp line 1660: > 1658: } > 1659: bool profitable(uint policy) const { > 1660: return _total_wins >= policy && !(_backedge_wins == 0 && _entry_wins > 0); I think this should be `(_entry_wins == 0 && _total_wins >= policy) || _backedge_wins >= policy`. src/hotspot/share/opto/loopnode.hpp line 1663: > 1661: // split if profitable. > 1662: bool profitable(int policy) const { > 1663: return policy < 0 || (_loop_entry_wins == 0 && _total_wins > policy) || _loop_back_wins > policy; `policy < 0` seems unnecessary, `wins` is initialized with 0 and is always incremented, so it cannot be negative. I assume you are guarding against a hypothetical arithmetic overflow, but signed overflow is UB in C++. So the program is ill-formed if that happens. Additionally, we will catch that with UBSAN. src/hotspot/share/opto/loopopts.cpp line 70: > 68: } > 69: > 70: SplitWins wins = SplitWins(); `SplitWins wins` will initialize a `SplitWins` variable using the default constructor, so `= SplitWins()` is unnecessary. src/hotspot/share/opto/loopopts.cpp line 1091: > 1089: } > 1090: > 1091: // Detect if split_through_phi would split an LShift that multiplies a I tried your patch and without this the test still vectorizes well. If this is necessary please provide another test demonstrating its necessity. src/hotspot/share/opto/loopopts.cpp line 1202: > 1200: // so 1 win is considered profitable. Big merges will require big > 1201: // cloning, so get a larger policy. > 1202: int policy = checked_cast(n_blk->req() >> 2); This change seems unnecessary. src/hotspot/share/opto/loopopts.cpp line 1508: > 1506: > 1507: // Now split the bool up thru the phi > 1508: Node *bolphi = split_thru_phi(bol, n_ctrl, 0); This change could be reverted if you keep `policy` being an `int`. ------------- PR Review: https://git.openjdk.org/jdk/pull/26429#pullrequestreview-3048157767 Marked as reviewed by qamai (Committer). PR Review: https://git.openjdk.org/jdk/pull/26429#pullrequestreview-3070996121 PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3106388733 PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3107857642 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226074337 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226077544 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226075898 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242338007 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242339561 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226072833 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242341678 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226078521 From mhaessig at openjdk.org Wed Jul 30 12:05:38 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 30 Jul 2025 12:05:38 GMT Subject: RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression In-Reply-To: References: Message-ID: On Tue, 22 Jul 2025 15:05:29 GMT, Manuel H?ssig wrote: > A loop of the form > > MemorySegment ms = {}; > for (long i = 0; i < ms.byteSize() / 8L; i++) { > // vectorizable work > } > > does not vectorize, whereas > > MemorySegment ms = {}; > long size = ms.byteSize(); > for (long i = 0; i < size / 8L; i++) { > // vectorizable work > } > > vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the `LShift` used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting `LShift`s modifying the IV of a *counted loop*: > > https://github.com/openjdk/jdk/blob/e3f85c961b4c1e5e01aedf3a0f4e1b0e6ff457fd/src/hotspot/share/opto/loopopts.cpp#L1172-L1176 > > Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization. > > So, why is the counted loop not detected? Because the call to `byteSize()` is inside the loop head, and `CiTypeFlow::clone_loop_heads()` duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first `ITER_GVN` in `PHASEIDEALLOOP1` will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization. > > @merykitty also provides an alternative explanation. A node is only split through a phi if that splitting is profitable. While the split looks to be profitable in the example above, it only generates wins on the loop entry edge. This ends up destroying the canonical loop structure and prevents further optimization. Other issues like [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) suffer from the same problem > > ## Change Description > > Based on @merykitty's reasoning, this PR tracks if wins in `split_through_phi()` are on the loop entry edge or the loop backedge. If there are wins on a loop entry edge, we do not consider the split to be profitable unless there are a lot of wins on the backedge. > >
Explored Alternatives > 1. Prevent splitting `LShift`s in uncounted loops that have the same shape as a counted loop would have. This fixes this specific issue, but causes potential regressions with uncounted loops. > 2. Insert a "`PHASEIDEALLOOP0`" with `LoopOptsNone` that only perfor... I would not rely solely on profile information to solve this, but it might be a good additional piece of information for the first proposed solution. I see, but, if I understand correctly, any logic that relates to profit will have to go into `split_through_phi()`. Since we already have special logic for not splitting `LShifts` in `split_if_with_blocks_pre()`, I would prefer to only have it there. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3106653136 PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3107735579 From qamai at openjdk.org Wed Jul 30 12:05:38 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 30 Jul 2025 12:05:38 GMT Subject: RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression In-Reply-To: References: Message-ID: On Wed, 23 Jul 2025 09:11:46 GMT, Manuel H?ssig wrote: >> A loop of the form >> >> MemorySegment ms = {}; >> for (long i = 0; i < ms.byteSize() / 8L; i++) { >> // vectorizable work >> } >> >> does not vectorize, whereas >> >> MemorySegment ms = {}; >> long size = ms.byteSize(); >> for (long i = 0; i < size / 8L; i++) { >> // vectorizable work >> } >> >> vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the `LShift` used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting `LShift`s modifying the IV of a *counted loop*: >> >> https://github.com/openjdk/jdk/blob/e3f85c961b4c1e5e01aedf3a0f4e1b0e6ff457fd/src/hotspot/share/opto/loopopts.cpp#L1172-L1176 >> >> Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization. >> >> So, why is the counted loop not detected? Because the call to `byteSize()` is inside the loop head, and `CiTypeFlow::clone_loop_heads()` duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first `ITER_GVN` in `PHASEIDEALLOOP1` will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization. >> >> @merykitty also provides an alternative explanation. A node is only split through a phi if that splitting is profitable. While the split looks to be profitable in the example above, it only generates wins on the loop entry edge. This ends up destroying the canonical loop structure and prevents further optimization. Other issues like [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) suffer from the same problem >> >> ## Change Description >> >> Based on @merykitty's reasoning, this PR tracks if wins in `split_through_phi()` are on the loop entry edge or the loop backedge. If there are wins on a loop entry edge, we do not consider the split to be profitable unless there are a lot of wins on the backedge. >> >>
Explored Alternatives >> 1. Prevent splitting `LShift`s in uncounted loops that have the same shape as a counted loop would have. This fixes this specific issue, but causes potential regressions with uncounted loops. >> 2. I... > > I would not rely solely on profile information to solve this, but it might be a good additional piece of information for the first proposed solution. @mhaessig You don't really need profile information, only that the profit is on the loop entry path and there is no profit on the loop backedge. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3106679172 From mhaessig at openjdk.org Wed Jul 30 12:05:38 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 30 Jul 2025 12:05:38 GMT Subject: RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression In-Reply-To: <2BCEe5coWSwvmmoUBLWZlzBs81azC4xeekoxZgLv_7I=.61ff0c53-5671-471e-96fb-85875116b5ac@github.com> References: <2BCEe5coWSwvmmoUBLWZlzBs81azC4xeekoxZgLv_7I=.61ff0c53-5671-471e-96fb-85875116b5ac@github.com> Message-ID: On Wed, 23 Jul 2025 12:34:37 GMT, Quan Anh Mai wrote: >> A loop of the form >> >> MemorySegment ms = {}; >> for (long i = 0; i < ms.byteSize() / 8L; i++) { >> // vectorizable work >> } >> >> does not vectorize, whereas >> >> MemorySegment ms = {}; >> long size = ms.byteSize(); >> for (long i = 0; i < size / 8L; i++) { >> // vectorizable work >> } >> >> vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the `LShift` used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting `LShift`s modifying the IV of a *counted loop*: >> >> https://github.com/openjdk/jdk/blob/e3f85c961b4c1e5e01aedf3a0f4e1b0e6ff457fd/src/hotspot/share/opto/loopopts.cpp#L1172-L1176 >> >> Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization. >> >> So, why is the counted loop not detected? Because the call to `byteSize()` is inside the loop head, and `CiTypeFlow::clone_loop_heads()` duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first `ITER_GVN` in `PHASEIDEALLOOP1` will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization. >> >> @merykitty also provides an alternative explanation. A node is only split through a phi if that splitting is profitable. While the split looks to be profitable in the example above, it only generates wins on the loop entry edge. This ends up destroying the canonical loop structure and prevents further optimization. Other issues like [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) suffer from the same problem >> >> ## Change Description >> >> Based on @merykitty's reasoning, this PR tracks if wins in `split_through_phi()` are on the loop entry edge or the loop backedge. If there are wins on a loop entry edge, we do not consider the split to be profitable unless there are a lot of wins on the backedge. >> >>
Explored Alternatives >> 1. Prevent splitting `LShift`s in uncounted loops that have the same shape as a counted loop would have. This fixes this specific issue, but causes potential regressions with uncounted loops. >> 2. I... > > From the principle point of view, splitting a node through the loop `Phi` is only profitable if the profit is in the loop backedge. From the practical point of view, there are some issues when `split_through_phi` is applied recklessly such as [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096). I believe taking loop head into consideration when splitting through `Phi`s can solve these issues. As a result, I think while you are at this issue, it is worth investigating this approach. @merykitty, I took me a while to understand, but now I implemented your suggestion and it works at least the case of this issue (testing is ongoing). Thank you for pushing back. EDIT: It also fixes [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096). Thank you for your review and your help with this PR, @merykitty! > src/hotspot/share/opto/loopnode.hpp line 1663: > >> 1661: // split if profitable. >> 1662: bool profitable(int policy) const { >> 1663: return policy < 0 || (_loop_entry_wins == 0 && _total_wins > policy) || _loop_back_wins > policy; > > `policy < 0` seems unnecessary, `wins` is initialized with 0 and is always incremented, so it cannot be negative. I assume you are guarding against a hypothetical arithmetic overflow, but signed overflow is UB in C++. So the program is ill-formed if that happens. Additionally, we will catch that with UBSAN. My intention was to clearly state that `policy = -1` means "always split". That confused me before. > src/hotspot/share/opto/loopopts.cpp line 70: > >> 68: } >> 69: >> 70: SplitWins wins = SplitWins(); > > `SplitWins wins` will initialize a `SplitWins` variable using the default constructor, so `= SplitWins()` is unnecessary. Thank you for pointing it out. Will remove. > src/hotspot/share/opto/loopopts.cpp line 1091: > >> 1089: } >> 1090: >> 1091: // Detect if split_through_phi would split an LShift that multiplies a > > I tried your patch and without this the test still vectorizes well. If this is necessary please provide another test demonstrating its necessity. Will do, when cleaning up for RFR. > src/hotspot/share/opto/loopopts.cpp line 1202: > >> 1200: // so 1 win is considered profitable. Big merges will require big >> 1201: // cloning, so get a larger policy. >> 1202: int policy = checked_cast(n_blk->req() >> 2); > > This change seems unnecessary. Now that you say it, yes, the shift right makes that check obsolete. Wll remove. > src/hotspot/share/opto/loopopts.cpp line 1508: > >> 1506: >> 1507: // Now split the bool up thru the phi >> 1508: Node *bolphi = split_thru_phi(bol, n_ctrl, 0); > > This change could be reverted if you keep `policy` being an `int`. Also, my change is wrong, because `policy == -1` means "split even if there are no wins". ------------- PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3108961271 PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3135981220 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242362758 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242367711 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2228078394 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242366956 PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2228089983 From qamai at openjdk.org Wed Jul 30 12:05:38 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Wed, 30 Jul 2025 12:05:38 GMT Subject: RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression In-Reply-To: <2BCEe5coWSwvmmoUBLWZlzBs81azC4xeekoxZgLv_7I=.61ff0c53-5671-471e-96fb-85875116b5ac@github.com> References: <2BCEe5coWSwvmmoUBLWZlzBs81azC4xeekoxZgLv_7I=.61ff0c53-5671-471e-96fb-85875116b5ac@github.com> Message-ID: On Wed, 23 Jul 2025 16:13:47 GMT, Quan Anh Mai wrote: >> A loop of the form >> >> MemorySegment ms = {}; >> for (long i = 0; i < ms.byteSize() / 8L; i++) { >> // vectorizable work >> } >> >> does not vectorize, whereas >> >> MemorySegment ms = {}; >> long size = ms.byteSize(); >> for (long i = 0; i < size / 8L; i++) { >> // vectorizable work >> } >> >> vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the `LShift` used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting `LShift`s modifying the IV of a *counted loop*: >> >> https://github.com/openjdk/jdk/blob/e3f85c961b4c1e5e01aedf3a0f4e1b0e6ff457fd/src/hotspot/share/opto/loopopts.cpp#L1172-L1176 >> >> Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization. >> >> So, why is the counted loop not detected? Because the call to `byteSize()` is inside the loop head, and `CiTypeFlow::clone_loop_heads()` duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first `ITER_GVN` in `PHASEIDEALLOOP1` will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization. >> >> @merykitty also provides an alternative explanation. A node is only split through a phi if that splitting is profitable. While the split looks to be profitable in the example above, it only generates wins on the loop entry edge. This ends up destroying the canonical loop structure and prevents further optimization. Other issues like [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) suffer from the same problem >> >> ## Change Description >> >> Based on @merykitty's reasoning, this PR tracks if wins in `split_through_phi()` are on the loop entry edge or the loop backedge. If there are wins on a loop entry edge, we do not consider the split to be profitable unless there are a lot of wins on the backedge. >> >>
Explored Alternatives >> 1. Prevent splitting `LShift`s in uncounted loops that have the same shape as a counted loop would have. This fixes this specific issue, but causes potential regressions with uncounted loops. >> 2. I... > > src/hotspot/share/opto/loopnode.hpp line 1660: > >> 1658: } >> 1659: bool profitable(uint policy) const { >> 1660: return _total_wins >= policy && !(_backedge_wins == 0 && _entry_wins > 0); > > I think this should be `(_entry_wins == 0 && _total_wins >= policy) || _backedge_wins >= policy`. My bad, since the original negative condition is `wins <= policy`, this should be `(_entry_wins == 0 && _total_wins > policy) || _backedge_wins > policy`. Fixing this solves all the issues in GHA. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2227776568 From mhaessig at openjdk.org Wed Jul 30 12:05:38 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Wed, 30 Jul 2025 12:05:38 GMT Subject: RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression In-Reply-To: References: <2BCEe5coWSwvmmoUBLWZlzBs81azC4xeekoxZgLv_7I=.61ff0c53-5671-471e-96fb-85875116b5ac@github.com> Message-ID: On Thu, 24 Jul 2025 08:08:16 GMT, Quan Anh Mai wrote: >> src/hotspot/share/opto/loopnode.hpp line 1660: >> >>> 1658: } >>> 1659: bool profitable(uint policy) const { >>> 1660: return _total_wins >= policy && !(_backedge_wins == 0 && _entry_wins > 0); >> >> I think this should be `(_entry_wins == 0 && _total_wins >= policy) || _backedge_wins >= policy`. > > My bad, since the original negative condition is `wins <= policy`, this should be `(_entry_wins == 0 && _total_wins > policy) || _backedge_wins > policy`. Fixing this solves all the issues in GHA. Can confirm. This also seems to solve JDK-8331659 and JDK-8332878. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2228283016 From fgao at openjdk.org Wed Jul 30 15:01:02 2025 From: fgao at openjdk.org (Fei Gao) Date: Wed, 30 Jul 2025 15:01:02 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 03:26:36 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Refine IR pattern and clean backend rules Thanks for updating it! I've submitted a test on a 256-bit sve machine. I'll get back to you once it?s finished. src/hotspot/share/opto/vectorIntrinsics.cpp line 1176: > 1174: } > 1175: > 1176: // Generate a vector mask by casting the input mask from "byte|short" type to "int" type for vector It seems that we're not doing "casting" here. Suggestion: // Widen the input mask "in" from "byte|short" to "int" for use in vector gather loads. // The "part" parameter selects which segment of the original mask to extend. src/hotspot/share/opto/vectorIntrinsics.cpp line 1186: > 1184: assert(part < 4, "must be"); > 1185: const TypeVect* temp_vt = TypeVect::makemask(T_SHORT, vt->length() * 2); > 1186: // If part == 0, the elements of the lowest 1/4 part are extended. Suggestion: // If part == 0, extend elements from the lowest 1/4 of the input. // If part == 1, extend elements from the second 1/4. // If part == 2, extend elements from the third 1/4. // If part == 3, extend elements from the highest 1/4. ------------- PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3071808794 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2242854832 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2242873220 From fbredberg at openjdk.org Wed Jul 30 15:38:44 2025 From: fbredberg at openjdk.org (Fredrik Bredberg) Date: Wed, 30 Jul 2025 15:38:44 GMT Subject: RFR: 8364141: Remove LockingMode related code from x86 Message-ID: Since the integration of [JDK-8359437](https://bugs.openjdk.org/browse/JDK-8359437) the `LockingMode` flag can no longer be set by the user, instead it's declared as `const int LockingMode = LM_LIGHTWEIGHT;`. This means that we can now safely remove all `LockingMode` related code from all platforms. This PR removes `LockingMode` related code from the **x86** platform. When all the `LockingMode` code has been removed from all platforms, we can go on and remove it from shared (non-platform specific) files as well. And finally remove the `LockingMode` variable itself. Passes tier1-tier5 with no added problems. ------------- Commit messages: - 8364141: Remove LockingMode related code from x86 Changes: https://git.openjdk.org/jdk/pull/26552/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26552&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8364141 Stats: 637 lines in 10 files changed: 44 ins; 537 del; 56 mod Patch: https://git.openjdk.org/jdk/pull/26552.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26552/head:pull/26552 PR: https://git.openjdk.org/jdk/pull/26552 From kvn at openjdk.org Wed Jul 30 15:54:06 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 30 Jul 2025 15:54:06 GMT Subject: RFR: 8352112: [ubsan] hotspot/share/code/relocInfo.cpp:130:37: runtime error: applying non-zero offset 18446744073709551614 to null pointer [v2] In-Reply-To: References: <7ja3_KpFi1NPc4EPFpMk3af7RgGtQYu0zGmrv05lCj0=.a7fb616e-8923-47f1-b869-3bb064d27f58@github.com> Message-ID: On Wed, 30 Jul 2025 09:03:29 GMT, Johan Sj?len wrote: >> Yes, it was fixed. And they were harmless. >> >> I think @jdksjolen linked it because of call stack. But I also don't know how it could cause NMT bug. >> @jdksjolen did you try to to undo these changes and reproduce https://bugs.openjdk.org/browse/JDK-8361382 ? >> >> >> V [libjvm.dylib+0xbf1c8c] VMError::report(outputStream*, bool)+0xa9c (mallocHeader.inline.hpp:107) >> V [libjvm.dylib+0xbf5d25] VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x575 >> V [libjvm.dylib+0x404e20] DebuggingContext::~DebuggingContext()+0x0 >> V [libjvm.dylib+0x8f770f] MallocHeader* MallocHeader::resolve_checked_impl(void*)+0x15f >> V [libjvm.dylib+0x8f720c] MallocTracker::record_free_block(void*)+0xc >> V [libjvm.dylib+0x9a719a] os::free(void*)+0xea >> V [libjvm.dylib+0x388fb4] CodeBlob::purge()+0x44 >> V [libjvm.dylib+0x978e98] nmethod::purge(bool)+0x308 >> V [libjvm.dylib+0x380439] ClassUnloadingContext::purge_nmethods()+0x69 > > My reasoning was based on the fact that what used to be set to a constant, `nullptr`, is no longer set to a constant. I'm saying that `blob_end()` isn't a constant, because it can be changed with `adjust_size()`, but I think that @vnkozlov said that that's unlikely, as it's only changed for interpreter stubs and not nmethods. The only other possibility is that `nmethod`s are copied, so their `this` pointer changes, this will make `blob_end()` change, and this may incur a double free. This double free is detected by NMT, which leads to the crash. > > I think it may still be best to 'fix' this by setting the `_mutable_data` to `nullptr` again and fixing the iterators, as it does simplify reasoning around this (and imho, understanding the code). We do not copy nmethods. At least until #23573 is integrated - and it will be under flag. `_mutable_data` field is initialized during final method installation into CodeCache - nothing modifies it for nmethods. I can add debug flag to CodeBlob to catch double free. But as I commented in [JDK-8361382](https://bugs.openjdk.org/browse/JDK-8361382) it is most likely the issue is a buffer overflow from preceding memory block which stomped over header. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24102#discussion_r2243145779 From kvn at openjdk.org Wed Jul 30 15:54:07 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 30 Jul 2025 15:54:07 GMT Subject: RFR: 8352112: [ubsan] hotspot/share/code/relocInfo.cpp:130:37: runtime error: applying non-zero offset 18446744073709551614 to null pointer [v2] In-Reply-To: References: <7ja3_KpFi1NPc4EPFpMk3af7RgGtQYu0zGmrv05lCj0=.a7fb616e-8923-47f1-b869-3bb064d27f58@github.com> Message-ID: <_yvfEt6gxtA0gaoAyeuaOzN8u7Og_QhZyKtWCp9_q2c=.864cd245-1681-4d42-80c7-cd9a00e45cef@github.com> On Wed, 30 Jul 2025 15:51:03 GMT, Vladimir Kozlov wrote: >> My reasoning was based on the fact that what used to be set to a constant, `nullptr`, is no longer set to a constant. I'm saying that `blob_end()` isn't a constant, because it can be changed with `adjust_size()`, but I think that @vnkozlov said that that's unlikely, as it's only changed for interpreter stubs and not nmethods. The only other possibility is that `nmethod`s are copied, so their `this` pointer changes, this will make `blob_end()` change, and this may incur a double free. This double free is detected by NMT, which leads to the crash. >> >> I think it may still be best to 'fix' this by setting the `_mutable_data` to `nullptr` again and fixing the iterators, as it does simplify reasoning around this (and imho, understanding the code). > > We do not copy nmethods. At least until #23573 is integrated - and it will be under flag. > > `_mutable_data` field is initialized during final method installation into CodeCache - nothing modifies it for nmethods. > > I can add debug flag to CodeBlob to catch double free. But as I commented in [JDK-8361382](https://bugs.openjdk.org/browse/JDK-8361382) it is most likely the issue is a buffer overflow from preceding memory block which stomped over header. I will do experiment with flag and let you know. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24102#discussion_r2243147688 From coleenp at openjdk.org Wed Jul 30 16:24:53 2025 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 30 Jul 2025 16:24:53 GMT Subject: RFR: 8364141: Remove LockingMode related code from x86 In-Reply-To: References: Message-ID: On Wed, 30 Jul 2025 13:17:37 GMT, Fredrik Bredberg wrote: > Since the integration of [JDK-8359437](https://bugs.openjdk.org/browse/JDK-8359437) the `LockingMode` flag can no longer be set by the user, instead it's declared as `const int LockingMode = LM_LIGHTWEIGHT;`. This means that we can now safely remove all `LockingMode` related code from all platforms. > > This PR removes `LockingMode` related code from the **x86** platform. > > When all the `LockingMode` code has been removed from all platforms, we can go on and remove it from shared (non-platform specific) files as well. And finally remove the `LockingMode` variable itself. > > Passes tier1-tier5 with no added problems. This looks really good. Thank you for the comment about balanced locking. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 389: > 387: // obj: object to lock > 388: // rax: tmp -- KILLED > 389: // t : tmp - cannot be obj nor rax -- KILLED This same comment is repeated just above so you probably don't need it here. ------------- Marked as reviewed by coleenp (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26552#pullrequestreview-3072397438 PR Review Comment: https://git.openjdk.org/jdk/pull/26552#discussion_r2243231761 From duke at openjdk.org Wed Jul 30 16:37:08 2025 From: duke at openjdk.org (duke) Date: Wed, 30 Jul 2025 16:37:08 GMT Subject: Withdrawn: 8252473: [TESTBUG] compiler tests fail with minimal VM: Unrecognized VM option In-Reply-To: References: Message-ID: On Wed, 26 Mar 2025 17:37:58 GMT, Zdenek Zambersky wrote: > This change adds ` -XX:-IgnoreUnrecognizedVMOptions` to problematic tests (or `@requires vm.compiler2.enabled` in one case), to prevent failures `Unrecognized VM option` on client VM. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/24262 From shade at openjdk.org Wed Jul 30 18:51:26 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 30 Jul 2025 18:51:26 GMT Subject: RFR: 8361211: C2: Final graph reshaping generates unencodeable klass constants Message-ID: See the bug for more investigation. I have tried to come up with an isolated test, but failed. So I am doing this change somewhat blindly, without a clear regression test. The investigation on the CTW points directly to this code, and I believe we should be more conservative in final graph reshaping. [JDK-8343206](https://bugs.openjdk.org/browse/JDK-8343206) added the assert for `ConNKlass`, which somehow does not trigger. I think it is safe to bail out of this transformation. Also, this only plugs this particular leak. I think we should really be disabling the abstract/interface encoding optimization until C2 does not expose itself to this issue on more paths. There is [JDK-8343218](https://bugs.openjdk.org/browse/JDK-8343218) that we can re-open. Additional testing: - [x] Linux x86_64 server fastdebug, a rare CTW failure does not reproduce anymore - [x] Linux x86_64 server fastdebug, `tier1` - [ ] Linux x86_64 server fastdebug, `all` - [ ] Linux AArch64 server fastdebug, `all` ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/26559/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26559&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8361211 Stats: 8 lines in 1 file changed: 6 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/26559.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26559/head:pull/26559 PR: https://git.openjdk.org/jdk/pull/26559 From ghan at openjdk.org Wed Jul 30 22:58:50 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Wed, 30 Jul 2025 22:58:50 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" [v4] In-Reply-To: References: Message-ID: > I'm able to consistently reproduce the problem using the following command line and test program ? > > java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java > > import java.util.Arrays; > public class Test{ > public static void main(String[] args) { > System.out.println("begin"); > byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; > System.out.println(Arrays.equals(arr1, arr2)); > System.out.println("end"); > } > } > > From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). > > In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch > Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. > > In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. > > Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. > > A reference to the relevant code paths is provided below : > image1 > image2 > > On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. > > However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size classification remains single_size regardless. > > This classification... Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - change T_LONG to T_ADDRESS in some intrinsic functions - Merge remote-tracking branch 'upstream/master' into 8359235 - Increase sleep time to ensure the method gets compiled - add regression test - Merge remote-tracking branch 'upstream/master' into 8359235 - 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26462/files - new: https://git.openjdk.org/jdk/pull/26462/files/611d2fd1..c90be2b5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26462&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26462&range=02-03 Stats: 6945 lines in 261 files changed: 3887 ins; 2617 del; 441 mod Patch: https://git.openjdk.org/jdk/pull/26462.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26462/head:pull/26462 PR: https://git.openjdk.org/jdk/pull/26462 From kvn at openjdk.org Wed Jul 30 23:12:53 2025 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 30 Jul 2025 23:12:53 GMT Subject: RFR: 8361211: C2: Final graph reshaping generates unencodeable klass constants In-Reply-To: References: Message-ID: <6sEYx73eZK7XQXOLg5u2qHUqXc2OoqqR_iAlc5C9QrU=.4474e9ad-0e3d-4fe0-9011-0dfc3f2e069a@github.com> On Wed, 30 Jul 2025 16:20:43 GMT, Aleksey Shipilev wrote: > See the bug for more investigation. I have tried to come up with an isolated test, but failed. So I am doing this change somewhat blindly, without a clear regression test. The investigation on the CTW points directly to this code, and I believe we should be more conservative in final graph reshaping. [JDK-8343206](https://bugs.openjdk.org/browse/JDK-8343206) added the assert for `ConNKlass`, which somehow does not trigger. I think it is safe to bail out of this transformation. > > Also, this only plugs this particular leak. I think we should really be disabling the abstract/interface encoding optimization until C2 does not expose itself to this issue on more paths. There is [JDK-8343218](https://bugs.openjdk.org/browse/JDK-8343218) that we can re-open. > > Additional testing: > - [x] Linux x86_64 server fastdebug, a rare CTW failure does not reproduce anymore > - [x] Linux x86_64 server fastdebug, `tier1` > - [ ] Linux x86_64 server fastdebug, `all` > - [ ] Linux AArch64 server fastdebug, `all` Do you know why logic in `CmpPNode::Ideal()` did not work? That is what @TobiHartmann pointed before. I have 2 assumptions which could be wrong: - We did not call IGVN transform when we do some constant folding and replaced its input with more exact klass which is note encodable. - Node's inputs were swapped when `CmpPNode::Ideal()` is called - code assumes that Decode is in(1) and Klass is in(2). May be it is something else. Would be interesting to track it done because it may cause other issues. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26559#issuecomment-3138073871 From ghan at openjdk.org Thu Jul 31 01:41:54 2025 From: ghan at openjdk.org (Guanqiang Han) Date: Thu, 31 Jul 2025 01:41:54 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 21:11:45 GMT, Dean Long wrote: >> @dean-long Thanks for the feedback! >> Initially, I also considered modifying do_vectorizedMismatch() to use new_register(T_ADDRESS), as you suggested. However, I found that this change would trigger a series of follow-up modifications. as shown below: >> image3 >> image4 >> That?s why I opted for a more localized fix . I believe this is still a reasonable compromise. On 64-bit platforms, both T_ADDRESS and T_LONG are 64-bit wide, and general-purpose registers are capable of holding either type. Moreover, the code already uses movptr for moving 64-bit wide data , as shown below: >> image5 >> So semantically, this modification in PR seems safe and practical in this context. >> That said, I fully agree that the current treatment of new_pointer_register() is a bit confusing, If you, or other experts familiar with this area, believe the RFE is reasonable and it gets opened, I?d be happy to take on the implementation. >> Thanks again for your insights, and I look forward to your feedback. > > @hgqxjj , I wasn't suggesting changing the new_pointer_register() implementation to use T_ADDRESS at this time, but to change intrinsics that call signature.append(T_ADDRESS) to use new_register(T_ADDRESS) for the register instead of with new_pointer_register(). As @TobiHartmann pointed out, we should fix all the intrinsics that are using signature.append(T_ADDRESS). hi , @dean-long @TobiHartmann , i've updated the patch based on your feedback. Please take another look when you have time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26462#issuecomment-3138299913 From fyang at openjdk.org Thu Jul 31 02:02:00 2025 From: fyang at openjdk.org (Fei Yang) Date: Thu, 31 Jul 2025 02:02:00 GMT Subject: RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10] In-Reply-To: References: Message-ID: On Tue, 15 Jul 2025 14:05:25 GMT, Yuri Gaevsky wrote: >> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware. >> >> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0. > > Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision: > > - removed tail processing with RVV instructions as simple scalar loop provides in general better results src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 2062: > 2060: vmv_s_x(v_powmax, pow31_highest); > 2061: > 2062: vsetvli(consumed, cnt, Assembler::e32, Assembler::m4); What's the performance look like with a smaller `lmul` (m1 or m2)? I am asking this because there are hardwares there (like SG2044) with a VLEN of 128 instead of 256 like on K1. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17413#discussion_r2244179129 From duke at openjdk.org Thu Jul 31 02:40:07 2025 From: duke at openjdk.org (erifan) Date: Thu, 31 Jul 2025 02:40:07 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v8] In-Reply-To: References: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> Message-ID: <7WsFhgNuJV99O_IxmrhPCWDMvRMsxY9ZRh_VGCYCL_M=.0612a34c-cb05-4ad6-b2dd-1ebc1fc03244@github.com> On Wed, 30 Jul 2025 07:06:52 GMT, Hannes Greule wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Set default warm up to 10000 for JTReg tests > > I think there are a few (follow-up?) improvements that can be made: > 1. Using KnownBits and checking against that rather than requiring a constant in `is_maskall_type`. This is probably a bit difficult to test for now. > 2. If the range of an input is known to be [-1, 0], we can use that as an input for a MaskAllNode. Hi @SirYwell thanks for your suggestions. But I'm not quite understand what you meant, can you elaborate? ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3138376026 From jbhateja at openjdk.org Thu Jul 31 03:11:08 2025 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 31 Jul 2025 03:11:08 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v8] In-Reply-To: References: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> Message-ID: On Wed, 30 Jul 2025 07:06:52 GMT, Hannes Greule wrote: > I think there are a few (follow-up?) improvements that can be made: > > 1. Using KnownBits and checking against that rather than requiring a constant in `is_maskall_type`. This is probably a bit difficult to test for now. > 2. If the range of an input is known to be [-1, 0], we can use that as an input for a MaskAllNode. Constants are the limiting case of KnownBits where all the bits are known, i.e., KnownBits.ZEROS | Known.Bits.ONES = -1, since the pattern check is especially over -1 / 0 constant values, hence what we have currently looks reasonable. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3138416237 From dlong at openjdk.org Thu Jul 31 03:34:55 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 31 Jul 2025 03:34:55 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v7] In-Reply-To: <_gJoTNnBpV2Y2ENO9s153NWZeq_ujs40-zoyuZstOqM=.69d1d039-5022-4beb-ae79-7fc4193f3a11@github.com> References: <_gJoTNnBpV2Y2ENO9s153NWZeq_ujs40-zoyuZstOqM=.69d1d039-5022-4beb-ae79-7fc4193f3a11@github.com> Message-ID: On Mon, 28 Jul 2025 08:33:50 GMT, Manuel H?ssig wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> readability suggestion > > Thank you for addressing my comments. I have done another pass and it looks good to me. Thank you, @mhaessig , for looking at this complicated code! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26121#issuecomment-3138444068 From jkarthikeyan at openjdk.org Thu Jul 31 03:37:47 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 31 Jul 2025 03:37:47 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v14] In-Reply-To: References: Message-ID: > Hi all, > This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: > > > Baseline Patch > Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement > VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) > VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) > VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) > > > I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: - Update tests, cleanup logic - Merge branch 'master' into vectorize-subword - Check for AVX2 for byte/long conversions - Whitespace and benchmark tweak - Address more comments, make test and benchmark more exhaustive - Merge from master - Fix copyright after merge - Fix copyright - Merge - Implement patch with VectorCastNode::implemented - ... and 6 more: https://git.openjdk.org/jdk/compare/8fcbb110...aabaafba ------------- Changes: https://git.openjdk.org/jdk/pull/23413/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23413&range=13 Stats: 578 lines in 12 files changed: 519 ins; 11 del; 48 mod Patch: https://git.openjdk.org/jdk/pull/23413.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23413/head:pull/23413 PR: https://git.openjdk.org/jdk/pull/23413 From jkarthikeyan at openjdk.org Thu Jul 31 03:37:49 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 31 Jul 2025 03:37:49 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v13] In-Reply-To: References: Message-ID: <_rkrkHcE6r68dUmpuYnzY3evs6Q_GksPTuopyxDD1JY=.0da7ba0c-1257-4f9e-a9d7-51af0679ef6d@github.com> On Wed, 28 May 2025 15:25:19 GMT, Vladimir Kozlov wrote: >> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision: >> >> Check for AVX2 for byte/long conversions > > src/hotspot/share/opto/superword.cpp line 2361: > >> 2359: >> 2360: // Subword cast: Element sizes differ, but the platform supports a cast to change the def shape to the use shape. >> 2361: if ((is_subword_type(def_bt) || is_subword_type(use_bt)) && VectorCastNode::implemented(-1, pack_size, def_bt, use_bt)) { > > I see you use this set of conditions 2 time. Can it be separate function? Also `-1` is strange argument for people who not familiar with code. May be add `/* comment */` to it. Or use some `#define` to have meaningful name for it. Thanks for the suggestion! I've moved the logic to a function to reduce code duplication and added a comment explaining the `-1`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2244268512 From jkarthikeyan at openjdk.org Thu Jul 31 03:51:05 2025 From: jkarthikeyan at openjdk.org (Jasmine Karthikeyan) Date: Thu, 31 Jul 2025 03:51:05 GMT Subject: RFR: 8342095: Add autovectorizer support for subword vector casts [v14] In-Reply-To: References: Message-ID: On Thu, 31 Jul 2025 03:37:47 GMT, Jasmine Karthikeyan wrote: >> Hi all, >> This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine: >> >> >> Baseline Patch >> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement >> VectorSubword.intToByte 1024 avgt 12 200.049 ? 19.787 ns/op 56.228 ? 3.535 ns/op (3.56x) >> VectorSubword.intToShort 1024 avgt 12 179.826 ? 1.539 ns/op 43.332 ? 1.166 ns/op (4.15x) >> VectorSubword.shortToByte 1024 avgt 12 245.580 ? 6.150 ns/op 29.757 ? 1.055 ns/op (8.25x) >> >> >> I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated! > > Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: > > - Update tests, cleanup logic > - Merge branch 'master' into vectorize-subword > - Check for AVX2 for byte/long conversions > - Whitespace and benchmark tweak > - Address more comments, make test and benchmark more exhaustive > - Merge from master > - Fix copyright after merge > - Fix copyright > - Merge > - Implement patch with VectorCastNode::implemented > - ... and 6 more: https://git.openjdk.org/jdk/compare/8fcbb110...aabaafba I've merged from master and updated the tests to support the changes from [JDK-8350177](https://bugs.openjdk.org/browse/JDK-8350177). Now the non-truncating nodes can compile again :) I've also added the changes from the code review. Let me know what you all think! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-3138463834 From snatarajan at openjdk.org Thu Jul 31 07:22:21 2025 From: snatarajan at openjdk.org (Saranya Natarajan) Date: Thu, 31 Jul 2025 07:22:21 GMT Subject: RFR: 8325482: Test that distinct seeds produce distinct traces for compiler stress flags Message-ID: The existing test (`compiler/debug/TestStress.java`) verifies that compiler stress options produce consistent traces when using the same seed. However, there is currently no test to ensure that different seeds result in different traces. ### Solution Added a test case to assess the distinctness of traces generated from different seeds. This fix addresses the fragility concern highlighted in [JDK-8325482](https://bugs.openjdk.org/browse/JDK-8325482) by verifying that traces produced using N (in this case 10) distinct seeds are all not identical. ### Changes to `compiler/debug/TestStress.java` While investigating this issue, I observed that in `compiler/debug/TestStress.java`, the stress options for macro expansion and macro elimination were not being triggered because there were fewer than 2 macro nodes. Note that the `shuffle_macro_nodes()` in` compile.cpp` is only meaningful when there are more than two macro nodes. The generated traces for macro expansion and macro elimination in `TestStress.java` were empty. I have proposed changes to address this problem. ------------- Commit messages: - Initial Fix Changes: https://git.openjdk.org/jdk/pull/26554/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26554&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8325482 Stats: 127 lines in 2 files changed: 126 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/26554.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26554/head:pull/26554 PR: https://git.openjdk.org/jdk/pull/26554 From hgreule at openjdk.org Thu Jul 31 07:29:59 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Thu, 31 Jul 2025 07:29:59 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v8] In-Reply-To: References: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> Message-ID: On Wed, 30 Jul 2025 07:06:52 GMT, Hannes Greule wrote: >> erifan has updated the pull request incrementally with one additional commit since the last revision: >> >> Set default warm up to 10000 for JTReg tests > > I think there are a few (follow-up?) improvements that can be made: > 1. Using KnownBits and checking against that rather than requiring a constant in `is_maskall_type`. This is probably a bit difficult to test for now. > 2. If the range of an input is known to be [-1, 0], we can use that as an input for a MaskAllNode. > Hi @SirYwell thanks for your suggestions. But I'm not quite understand what you meant, can you elaborate? @erifan for my first point, knowing that the lower n bits are all 0 or all 1 is enough, i.e., whether `(type->_bits._ones & mask) == mask` (equivalent to `maskAll(true)`) or `(type->_bits._zeros & mask) == mask` (equivalent to `maskAll(false)`). I think we can't test that part well right now because other nodes are missing KnownBits specific Value() implementations. For the second one, if `type->_lo == -1 && type->_hi == 0`, then we know that the node with this type can be used to represent true or false respectively. I hacked something together to clarify what I mean: https://github.com/SirYwell/jdk/commit/02e13a479f5e627cc997939865cd1816942d8309 Please let me know if there's still something unclear. (That said I'm completely fine with the PR as-is, especially as the KnownBits part is hard to test right now.) ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3138849095 From chagedorn at openjdk.org Thu Jul 31 07:35:57 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 31 Jul 2025 07:35:57 GMT Subject: RFR: 8325482: Test that distinct seeds produce distinct traces for compiler stress flags In-Reply-To: References: Message-ID: On Wed, 30 Jul 2025 13:38:37 GMT, Saranya Natarajan wrote: > The existing test (`compiler/debug/TestStress.java`) verifies that compiler stress options produce consistent traces when using the same seed. However, there is currently no test to ensure that different seeds result in different traces. > > ### Solution > Added a test case to assess the distinctness of traces generated from different seeds. This fix addresses the fragility concern highlighted in [JDK-8325482](https://bugs.openjdk.org/browse/JDK-8325482) by verifying that traces produced using N (in this case 10) distinct seeds are all not identical. > > ### Changes to `compiler/debug/TestStress.java` > While investigating this issue, I observed that in `compiler/debug/TestStress.java`, the stress options for macro expansion and macro elimination were not being triggered because there were fewer than 2 macro nodes. Note that the `shuffle_macro_nodes()` in` compile.cpp` is only meaningful when there are more than two macro nodes. The generated traces for macro expansion and macro elimination in `TestStress.java` were empty. I have proposed changes to address this problem. Thanks for adding such a test. A few comments, otherwise, it looks good! test/hotspot/jtreg/compiler/debug/TestStressDistinctSeed.java line 36: > 34: * @key stress randomness > 35: * @requires vm.debug == true & vm.compiler2.enabled > 36: * @requires vm.flagless can be merged: Suggestion: * @requires vm.debug == true & vm.compiler2.enabled & vm.flagless test/hotspot/jtreg/compiler/debug/TestStressDistinctSeed.java line 37: > 35: * @requires vm.debug == true & vm.compiler2.enabled > 36: * @requires vm.flagless > 37: * @summary Tests that stress compilations with the N different seed yield different Suggestion: * @summary Tests that stress compilations with the N different seeds yield different test/hotspot/jtreg/compiler/debug/TestStressDistinctSeed.java line 102: > 100: ccpTraceSet.add(ccpTrace(s)); > 101: macroExpansionTraceSet.add(macroExpansionTrace(s)); > 102: macroEliminationTraceSet.add(macroEliminationTrace(s)); A suggestion, do you also want to check here that two runs with the same seed produce the same result to show that different seeds really produce different results due to the seed and not just some indeterminism with the test itself? How long does your test need now and afterwards with a fastdebug build? Maybe we can also lower the number of seeds if it takes too long or only do the equality-test for a single seed. ------------- PR Review: https://git.openjdk.org/jdk/pull/26554#pullrequestreview-3074273830 PR Review Comment: https://git.openjdk.org/jdk/pull/26554#discussion_r2244573969 PR Review Comment: https://git.openjdk.org/jdk/pull/26554#discussion_r2244574237 PR Review Comment: https://git.openjdk.org/jdk/pull/26554#discussion_r2244588666 From duke at openjdk.org Thu Jul 31 08:09:57 2025 From: duke at openjdk.org (erifan) Date: Thu, 31 Jul 2025 08:09:57 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v8] In-Reply-To: References: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> Message-ID: On Thu, 31 Jul 2025 07:27:43 GMT, Hannes Greule wrote: >> I think there are a few (follow-up?) improvements that can be made: >> 1. Using KnownBits and checking against that rather than requiring a constant in `is_maskall_type`. This is probably a bit difficult to test for now. >> 2. If the range of an input is known to be [-1, 0], we can use that as an input for a MaskAllNode. > >> Hi @SirYwell thanks for your suggestions. But I'm not quite understand what you meant, can you elaborate? > > @erifan for my first point, knowing that the lower n bits are all 0 or all 1 is enough, i.e., whether `(type->_bits._ones & mask) == mask` (equivalent to `maskAll(true)`) or `(type->_bits._zeros & mask) == mask` (equivalent to `maskAll(false)`). I think we can't test that part well right now because other nodes are missing KnownBits specific Value() implementations. > > For the second one, if `type->_lo == -1 && type->_hi == 0`, then we know that the node with this type can be used to represent true or false respectively. > > I hacked something together to clarify what I mean: https://github.com/SirYwell/jdk/commit/02e13a479f5e627cc997939865cd1816942d8309 > > Please let me know if there's still something unclear. > > (That said I'm completely fine with the PR as-is, especially as the KnownBits part is hard to test right now.) @SirYwell, thanks for your explanation, now I got your points. It's a good idea, with your suggestions, this optimization may apply to more cases. As you said, the KnownBits part is hard to test right now, so that's it for now. ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3138958939 From jsjolen at openjdk.org Thu Jul 31 08:34:03 2025 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Thu, 31 Jul 2025 08:34:03 GMT Subject: RFR: 8352112: [ubsan] hotspot/share/code/relocInfo.cpp:130:37: runtime error: applying non-zero offset 18446744073709551614 to null pointer [v2] In-Reply-To: <_yvfEt6gxtA0gaoAyeuaOzN8u7Og_QhZyKtWCp9_q2c=.864cd245-1681-4d42-80c7-cd9a00e45cef@github.com> References: <7ja3_KpFi1NPc4EPFpMk3af7RgGtQYu0zGmrv05lCj0=.a7fb616e-8923-47f1-b869-3bb064d27f58@github.com> <_yvfEt6gxtA0gaoAyeuaOzN8u7Og_QhZyKtWCp9_q2c=.864cd245-1681-4d42-80c7-cd9a00e45cef@github.com> Message-ID: On Wed, 30 Jul 2025 15:51:42 GMT, Vladimir Kozlov wrote: >> We do not copy nmethods. At least until #23573 is integrated - and it will be under flag. >> >> `_mutable_data` field is initialized during final method installation into CodeCache - nothing modifies it for nmethods. >> >> I can add debug flag to CodeBlob to catch double free. But as I commented in [JDK-8361382](https://bugs.openjdk.org/browse/JDK-8361382) it is most likely the issue is a buffer overflow from preceding memory block which stomped over header. > > I will do experiment with flag and let you know. Thank you ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/24102#discussion_r2244724403 From shade at openjdk.org Thu Jul 31 08:45:54 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 31 Jul 2025 08:45:54 GMT Subject: RFR: 8361211: C2: Final graph reshaping generates unencodeable klass constants In-Reply-To: <6sEYx73eZK7XQXOLg5u2qHUqXc2OoqqR_iAlc5C9QrU=.4474e9ad-0e3d-4fe0-9011-0dfc3f2e069a@github.com> References: <6sEYx73eZK7XQXOLg5u2qHUqXc2OoqqR_iAlc5C9QrU=.4474e9ad-0e3d-4fe0-9011-0dfc3f2e069a@github.com> Message-ID: On Wed, 30 Jul 2025 23:10:27 GMT, Vladimir Kozlov wrote: > Do you know why logic in `CmpPNode::Ideal()` did not work? That is what @TobiHartmann pointed before. I have not been able to track it down. The CTW failure I was chasing was highly intermittent and only reproducible in rare conditions. I suspect the same: the conditions where `CmpPNode::Ideal` run are limited and not guaranteed to fold out _all_ the unencodable constants. > Would be interesting to track it done because it may cause other issues. I agree this likely points to a more widespread problem. To be honest, I am pretty horrified that we emit the unencodeable `ConN`-s, and _then_ rely on various node idealization rules to knock them down. So now, if we hold `ConN`, we cannot be sure it would not break down the road! Speaking of nightmarish scenarios, I cannot see, for example, what prevents a particular arch-specific matching rule to assume that `ConN` is encodeable and start doing tricks based on that assumption. This PR only handles one limited case in final graph reshaping where I saw this definitely failing. But we also emit these constants as the matter of course during parsing, see the [JDK-8343218](https://bugs.openjdk.org/browse/JDK-8343218) comment for example stack trace. I think we must do a change that puts the abstract/interface encoding optimization behind the feature flag, and we should disable that flag by default, until we are 99.(9)% sure C2 is immune to these issues. A better place to discuss this would be [JDK-8343218](https://bugs.openjdk.org/browse/JDK-8343218). Meanwhile, I would like to plug the leak in final graph reshaping with this PR. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26559#issuecomment-3139062739 From thartmann at openjdk.org Thu Jul 31 09:43:30 2025 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 31 Jul 2025 09:43:30 GMT Subject: RFR: 8364409: [BACKOUT] 8350988: Consolidate Identity of self-inverse operations In-Reply-To: References: Message-ID: On Thu, 31 Jul 2025 09:36:20 GMT, Manuel H?ssig wrote: > This reverts commit 66b5dba (review #23851). Unfortunately, it does not back out cleanly due to a `ReverseBytesNode` base class introduced in #24382, which was easily resolved by using the new base class. > > Testing: > - [ ] Github Actions > - [ ] tier1 - tier3 plus some internal testing Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26570#pullrequestreview-3074680676 From mhaessig at openjdk.org Thu Jul 31 09:43:30 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 31 Jul 2025 09:43:30 GMT Subject: RFR: 8364409: [BACKOUT] 8350988: Consolidate Identity of self-inverse operations Message-ID: This reverts commit 66b5dba (review #23851). Unfortunately, it does not back out cleanly due to a `ReverseBytesNode` base class introduced in #24382, which was easily resolved by using the new base class. Testing: - [ ] Github Actions - [ ] tier1 - tier3 plus some internal testing ------------- Commit messages: - Revert "8350988: Consolidate Identity of self-inverse operations" Changes: https://git.openjdk.org/jdk/pull/26570/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26570&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8364409 Stats: 245 lines in 4 files changed: 8 ins; 225 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/26570.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26570/head:pull/26570 PR: https://git.openjdk.org/jdk/pull/26570 From bmaillard at openjdk.org Thu Jul 31 09:48:00 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Thu, 31 Jul 2025 09:48:00 GMT Subject: RFR: 8364409: [BACKOUT] 8350988: Consolidate Identity of self-inverse operations In-Reply-To: References: Message-ID: On Thu, 31 Jul 2025 09:36:20 GMT, Manuel H?ssig wrote: > This reverts commit 66b5dba (review #23851). Unfortunately, it does not back out cleanly due to a `ReverseBytesNode` base class introduced in #24382, which was easily resolved by using the new base class. > > Testing: > - [ ] Github Actions > - [ ] tier1 - tier3 plus some internal testing Looks good to me as well! ------------- Marked as reviewed by bmaillard (Author). PR Review: https://git.openjdk.org/jdk/pull/26570#pullrequestreview-3074701126 From hgreule at openjdk.org Thu Jul 31 10:30:09 2025 From: hgreule at openjdk.org (Hannes Greule) Date: Thu, 31 Jul 2025 10:30:09 GMT Subject: RFR: 8364409: [BACKOUT] Consolidate Identity of self-inverse operations In-Reply-To: References: Message-ID: On Thu, 31 Jul 2025 09:36:20 GMT, Manuel H?ssig wrote: > This reverts commit 66b5dba (review #23851). Unfortunately, it does not back out cleanly due to a `ReverseBytesNode` base class introduced in #24382, which was easily resolved by using the new base class. > > Testing: > - [ ] Github Actions > - [ ] tier1 - tier3 plus some internal testing Thanks! ------------- Marked as reviewed by hgreule (Author). PR Review: https://git.openjdk.org/jdk/pull/26570#pullrequestreview-3074837242 From shade at openjdk.org Thu Jul 31 11:13:56 2025 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 31 Jul 2025 11:13:56 GMT Subject: RFR: 8360557: CTW: Inline cold methods to reach more code [v2] In-Reply-To: <5znMFGgSuss2iAJ3cUBnmIKrfniGHx5W6CpY3TpNO_8=.0148fb6b-206a-4b57-8886-db80d606b18f@github.com> References: <5znMFGgSuss2iAJ3cUBnmIKrfniGHx5W6CpY3TpNO_8=.0148fb6b-206a-4b57-8886-db80d606b18f@github.com> Message-ID: On Wed, 2 Jul 2025 08:27:24 GMT, Aleksey Shipilev wrote: >> We use CTW testing for making sure compilers behave well. But we compile the code that is not executed at all, and since our inlining heuristics often looks back at profiles, we end up not actually inlining all too much! This means CTW testing likely misses lots of bugs that normal code is exposed to, especially e.g. in loop optimizations. >> >> There is an intrinsic tradeoff with accepting more inilned methods in CTW: the compilation time gets significantly worse. With just accepting the cold methods we have reasonable CTW times, eating the improvements we have committed in mainline recently. And it still finds bugs. See the RFE for sample data. >> >> After this lands and CTW starts to compile cold methods, one can greatly expand the scope of the CTW testing by overriding the static inlining limits. Doing e.g. `TEST_VM_OPTS="-XX:MaxInlineSize=70 -XX:C1MaxInlineSize=70"` finds even more bugs. Unfortunately, the compilation times suffer so much, they are impractical to run in standard configurations, see data in RFE. We will enable some of that testing in special testing pipelines. >> >> Pre-empting the question: "Well, why not use -Xcomp then, and make sure it inlines well?" The answer is in RFE as well: Xcomp causes _a lot_ of stray compilations for JDK and CTW infra itself. For small JARs in large corpus this eats precious testing time that we would instead like to spend on deeper inlining in the actual JAR code. This also does not force us to look into how CTW works in Xcomp at all; I expect some surprises there. Feather-touching the inlining heuristic paths to just accept methods without looking at profiles looks better. >> >> Tobias had an idea to implement the stress randomized inlining that would expand the scope of inlining. This improvement stacks well with it. This improvement provides the base case of inlining most reasonable methods, and then allow stress infra to inline some more on top of that. >> >> Additional testing: >> - [x] GHA >> - [x] Linux x86_64 server fastdebug, `applications/ctw/modules` >> - [x] Linux x86_64 server fastdebug, large CTW corpus (now failing in interesting ways) > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/share/compiler/compiler_globals.hpp > > Co-authored-by: Tobias Hartmann Still working out the bugs. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26068#issuecomment-3139510011 From mchevalier at openjdk.org Thu Jul 31 11:14:50 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 31 Jul 2025 11:14:50 GMT Subject: RFR: 8360561: PhaseIdealLoop::create_new_if_for_predicate hits "must be a uct if pattern" assert Message-ID: Did you know that ranges can be disjoints and yet not ordered?! Well, in modular arithmetic. Let's look at a simplistic example: int x; if (?) { x = -1; } else { x = 1; } if (x != 0) { return; } // Unreachable With signed ranges, before the second `if`, `x` is in `[-1, 1]`. Which is enough to enter to second if, but not enough to prove you have to enter it: it wrongly seems that after the second `if` is still reachable. Twaddle! With unsigned ranges, at this point `x` is in `[1, 2^32-1]`, and then, it is clear that `x != 0`. This information is used to refine the value of `x` in the (missing) else-branch, and so, after the if. This is done with simple lattice meet (Hotspot's join): in the else-branch, the possible values of `x` are the meet of what is was worth before, and the interval in the guard, that is `[0, 0]`. Thanks to the unsigned range, this is known to be empty (that is bottom, or Hotspot's top). And with a little reduced product, the whole type of `x` is empty as well. Yet, this information is not used to kill control yet. This is here the center of the problem: we have a situation such as: 2 after-CastII After node `110 CastII` is idealized, it is found to be Top, and then the uncommon trap at `129` is replaced by `238 Halt` by being value-dead. 1 before-CastII Since the control is not killed, the node stay there, eventually making some predicate-related assert fail as a trap is expected under a `ParsePredicate`. And that's what this change proposes: when comparing integers with non-ordered ranges, let's see if the unsigned ranges overlap, by computing the meet. If the intersection is empty, then the values can't be equals, without being able to order them. This is new! Without unsigned information for signed integer, either they overlap, or we can order them. Adding modular arithmetic allows to have non-overlapping ranges that are also not ordered. Let's also notice that 0 is special: it is important bounds are on each side of 0 (or 2^31, the other discontinuity). For instance if `x` can be 1 or 5, for instance, both the signed and unsigned range will agree on `[1, 5]` and not be able to prove it's, let's say, 3. What would there be other ways to treat this problem a bit more generally? The classic solution is not to use intervals all the time: allow small set of values, up to a fixed cardinal (for instance 5 or 10), after which we switch to a range. This is quite easy and handle many cases: it is not that common that it is important for a variable to be equal to one of 10 distinct values, but not anything else in between. A modulo domain would also work along with interval (with a reduced product), but for only two values, or specific cases. That is not very general. A donut domain can also be helpful, but it needs smart heuristic. For 2 points, there are two donuts: in the previous example `[1, 5]` and `[INT_MIN, INT_MAX] \ [2, 4]`, but only the second allows to prove 3 is not in. Having signed and unsigned range is somewhat having both donuts for some cases, and having just one when they agree. There is another underlying question: why do we need to have code both for meet (HS's join, to refine the value of `x`), and for guarding (to know whether a branch is taken). A typical abstract interpreter would actually do that with just one step, using a `Guard` function that refines a abstract state given a condition to satisfy. The resulting state is whatever enters the branch, already refined. If the branch is impossible, then the state has an empty concretization. This happens typically when one variable (in a non-relational domain) has an empty value (bottom), then the whole abstract state is empty. It can then be optimized into skipping the whole branch. In Hotspot, there are some major differences: - the evaluation of the condition is not monolithically done in the abstract domain, but instead we want abstract value of each node - at the end, we request the value of a comparison, without knowing which operator we are going to use, so the abstract value needs to specify all the operators that would allow entering the branch: instead of having a refined abstract state, we just know for which comparison operator, the abstract state is not empty. We could imagine another way of working, returning the refined value of each variable in a condition (using a side table or spamming Cast nodes), for a given `BoolNode`, without holding the abstract domain by the hand too much. But of course, asking first "for which operators is the comparison non-empty", and then "give me the refined value of this variable for this given operator" leads to duplication of work. Thanks, Marc ------------- Commit messages: - Tests - Trying for CmpU CmpL CmpUL - Fix EOF - + tests - cc2logical: CC_NE with != -> yes! - != if empty overlap Changes: https://git.openjdk.org/jdk/pull/26504/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26504&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8360561 Stats: 227 lines in 4 files changed: 227 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/26504.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26504/head:pull/26504 PR: https://git.openjdk.org/jdk/pull/26504 From qamai at openjdk.org Thu Jul 31 11:14:50 2025 From: qamai at openjdk.org (Quan Anh Mai) Date: Thu, 31 Jul 2025 11:14:50 GMT Subject: RFR: 8360561: PhaseIdealLoop::create_new_if_for_predicate hits "must be a uct if pattern" assert In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 12:31:49 GMT, Marc Chevalier wrote: > Did you know that ranges can be disjoints and yet not ordered?! Well, in modular arithmetic. > > Let's look at a simplistic example: > > int x; > if (?) { > x = -1; > } else { > x = 1; > } > > if (x != 0) { > return; > } > // Unreachable > > > With signed ranges, before the second `if`, `x` is in `[-1, 1]`. Which is enough to enter to second if, but not enough to prove you have to enter it: it wrongly seems that after the second `if` is still reachable. Twaddle! > > With unsigned ranges, at this point `x` is in `[1, 2^32-1]`, and then, it is clear that `x != 0`. This information is used to refine the value of `x` in the (missing) else-branch, and so, after the if. This is done with simple lattice meet (Hotspot's join): in the else-branch, the possible values of `x` are the meet of what is was worth before, and the interval in the guard, that is `[0, 0]`. Thanks to the unsigned range, this is known to be empty (that is bottom, or Hotspot's top). And with a little reduced product, the whole type of `x` is empty as well. Yet, this information is not used to kill control yet. > > This is here the center of the problem: we have a situation such as: > 2 after-CastII > After node `110 CastII` is idealized, it is found to be Top, and then the uncommon trap at `129` is replaced by `238 Halt` by being value-dead. > 1 before-CastII > Since the control is not killed, the node stay there, eventually making some predicate-related assert fail as a trap is expected under a `ParsePredicate`. > > And that's what this change proposes: when comparing integers with non-ordered ranges, let's see if the unsigned ranges overlap, by computing the meet. If the intersection is empty, then the values can't be equals, without being able to order them. This is new! Without unsigned information for signed integer, either they overlap, or we can order them. Adding modular arithmetic allows to have non-overlapping ranges that are also not ordered. > > Let's also notice that 0 is special: it is important bounds are on each side of 0 (or 2^31, the other discontinuity). For instance if `x` can be 1 or 5, for instance, both the signed and unsigned range will agree on `[1, 5]` and not be able to prove it's, let's say, 3. > > What would there be other ways to treat this problem a bit ... Should these be done for `CmpL`, `CmpU`, `CmpUL` as well? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26504#issuecomment-3128074724 From mchevalier at openjdk.org Thu Jul 31 11:14:50 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 31 Jul 2025 11:14:50 GMT Subject: RFR: 8360561: PhaseIdealLoop::create_new_if_for_predicate hits "must be a uct if pattern" assert In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 16:36:12 GMT, Quan Anh Mai wrote: >> Did you know that ranges can be disjoints and yet not ordered?! Well, in modular arithmetic. >> >> Let's look at a simplistic example: >> >> int x; >> if (?) { >> x = -1; >> } else { >> x = 1; >> } >> >> if (x != 0) { >> return; >> } >> // Unreachable >> >> >> With signed ranges, before the second `if`, `x` is in `[-1, 1]`. Which is enough to enter to second if, but not enough to prove you have to enter it: it wrongly seems that after the second `if` is still reachable. Twaddle! >> >> With unsigned ranges, at this point `x` is in `[1, 2^32-1]`, and then, it is clear that `x != 0`. This information is used to refine the value of `x` in the (missing) else-branch, and so, after the if. This is done with simple lattice meet (Hotspot's join): in the else-branch, the possible values of `x` are the meet of what is was worth before, and the interval in the guard, that is `[0, 0]`. Thanks to the unsigned range, this is known to be empty (that is bottom, or Hotspot's top). And with a little reduced product, the whole type of `x` is empty as well. Yet, this information is not used to kill control yet. >> >> This is here the center of the problem: we have a situation such as: >> 2 after-CastII >> After node `110 CastII` is idealized, it is found to be Top, and then the uncommon trap at `129` is replaced by `238 Halt` by being value-dead. >> 1 before-CastII >> Since the control is not killed, the node stay there, eventually making some predicate-related assert fail as a trap is expected under a `ParsePredicate`. >> >> And that's what this change proposes: when comparing integers with non-ordered ranges, let's see if the unsigned ranges overlap, by computing the meet. If the intersection is empty, then the values can't be equals, without being able to order them. This is new! Without unsigned information for signed integer, either they overlap, or we can order them. Adding modular arithmetic allows to have non-overlapping ranges that are also not ordered. >> >> Let's also notice that 0 is special: it is important bounds are on each side of 0 (or 2^31, the other discontinuity). For instance if `x` can be 1 or 5, for instance, both the signed and unsigned range will agree on `[1, 5]` and not be able to prove it's, let's say, 3. > ... > > Should these be done for `CmpL`, `CmpU`, `CmpUL` as well? @merykitty yes, probably, I was indeed looking into which flavors of `Cmp` would need something like that, and how hard it'd be to exhibit the problem. It's still a draft, it wasn't quite done ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/26504#issuecomment-3130996399 From mchevalier at openjdk.org Thu Jul 31 11:14:51 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 31 Jul 2025 11:14:51 GMT Subject: RFR: 8360561: PhaseIdealLoop::create_new_if_for_predicate hits "must be a uct if pattern" assert In-Reply-To: References: Message-ID: On Mon, 28 Jul 2025 12:31:49 GMT, Marc Chevalier wrote: > Did you know that ranges can be disjoints and yet not ordered?! Well, in modular arithmetic. > > Let's look at a simplistic example: > > int x; > if (?) { > x = -1; > } else { > x = 1; > } > > if (x != 0) { > return; > } > // Unreachable > > > With signed ranges, before the second `if`, `x` is in `[-1, 1]`. Which is enough to enter to second if, but not enough to prove you have to enter it: it wrongly seems that after the second `if` is still reachable. Twaddle! > > With unsigned ranges, at this point `x` is in `[1, 2^32-1]`, and then, it is clear that `x != 0`. This information is used to refine the value of `x` in the (missing) else-branch, and so, after the if. This is done with simple lattice meet (Hotspot's join): in the else-branch, the possible values of `x` are the meet of what is was worth before, and the interval in the guard, that is `[0, 0]`. Thanks to the unsigned range, this is known to be empty (that is bottom, or Hotspot's top). And with a little reduced product, the whole type of `x` is empty as well. Yet, this information is not used to kill control yet. > > This is here the center of the problem: we have a situation such as: > 2 after-CastII > After node `110 CastII` is idealized, it is found to be Top, and then the uncommon trap at `129` is replaced by `238 Halt` by being value-dead. > 1 before-CastII > Since the control is not killed, the node stay there, eventually making some predicate-related assert fail as a trap is expected under a `ParsePredicate`. > > And that's what this change proposes: when comparing integers with non-ordered ranges, let's see if the unsigned ranges overlap, by computing the meet. If the intersection is empty, then the values can't be equals, without being able to order them. This is new! Without unsigned information for signed integer, either they overlap, or we can order them. Adding modular arithmetic allows to have non-overlapping ranges that are also not ordered. > > Let's also notice that 0 is special: it is important bounds are on each side of 0 (or 2^31, the other discontinuity). For instance if `x` can be 1 or 5, for instance, both the signed and unsigned range will agree on `[1, 5]` and not be able to prove it's, let's say, 3. > > What would there be other ways to treat this problem a bit ... After looking deeper, it seems that we don't have to do the same change: this crash doesn't seem possible in other situations because of the refinement (leading to Top integers) happening only in specific cases, and especially not for longs. Yet, we surely can do the change, seems correct to me. Also, even if it doesn't solve a crash, it can be beneficial (see the test with longs): it already worked, but now, we can get rid of a path, and the graph is simpler. Overall, we don't have to, but we can and should, so there it is! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26504#issuecomment-3139500654 From syan at openjdk.org Thu Jul 31 11:14:52 2025 From: syan at openjdk.org (SendaoYan) Date: Thu, 31 Jul 2025 11:14:52 GMT Subject: RFR: 8360561: PhaseIdealLoop::create_new_if_for_predicate hits "must be a uct if pattern" assert In-Reply-To: References: Message-ID: <1zdDyoDt1yJzlNEY8VziwbjNt7PWsQvjp513qSX4Gbs=.a3115e6e-af16-4b82-a03f-b0e766fecabc@github.com> On Mon, 28 Jul 2025 12:31:49 GMT, Marc Chevalier wrote: > Did you know that ranges can be disjoints and yet not ordered?! Well, in modular arithmetic. > > Let's look at a simplistic example: > > int x; > if (?) { > x = -1; > } else { > x = 1; > } > > if (x != 0) { > return; > } > // Unreachable > > > With signed ranges, before the second `if`, `x` is in `[-1, 1]`. Which is enough to enter to second if, but not enough to prove you have to enter it: it wrongly seems that after the second `if` is still reachable. Twaddle! > > With unsigned ranges, at this point `x` is in `[1, 2^32-1]`, and then, it is clear that `x != 0`. This information is used to refine the value of `x` in the (missing) else-branch, and so, after the if. This is done with simple lattice meet (Hotspot's join): in the else-branch, the possible values of `x` are the meet of what is was worth before, and the interval in the guard, that is `[0, 0]`. Thanks to the unsigned range, this is known to be empty (that is bottom, or Hotspot's top). And with a little reduced product, the whole type of `x` is empty as well. Yet, this information is not used to kill control yet. > > This is here the center of the problem: we have a situation such as: > 2 after-CastII > After node `110 CastII` is idealized, it is found to be Top, and then the uncommon trap at `129` is replaced by `238 Halt` by being value-dead. > 1 before-CastII > Since the control is not killed, the node stay there, eventually making some predicate-related assert fail as a trap is expected under a `ParsePredicate`. > > And that's what this change proposes: when comparing integers with non-ordered ranges, let's see if the unsigned ranges overlap, by computing the meet. If the intersection is empty, then the values can't be equals, without being able to order them. This is new! Without unsigned information for signed integer, either they overlap, or we can order them. Adding modular arithmetic allows to have non-overlapping ranges that are also not ordered. > > Let's also notice that 0 is special: it is important bounds are on each side of 0 (or 2^31, the other discontinuity). For instance if `x` can be 1 or 5, for instance, both the signed and unsigned range will agree on `[1, 5]` and not be able to prove it's, let's say, 3. > > What would there be other ways to treat this problem a bit ... test/hotspot/jtreg/compiler/igvn/CmpDisjointButNonOrderedRanges2.java line 30: > 28: * Comparing such values in such range with != should always be true. > 29: * @modules java.base/jdk.internal.util > 30: * @run main/othervm -Xbatch Since these two new tests use specific JVM options, do these tests needed '@requires vm.flagless' directive. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26504#discussion_r2236771174 From mchevalier at openjdk.org Thu Jul 31 11:14:53 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 31 Jul 2025 11:14:53 GMT Subject: RFR: 8360561: PhaseIdealLoop::create_new_if_for_predicate hits "must be a uct if pattern" assert In-Reply-To: References: <1zdDyoDt1yJzlNEY8VziwbjNt7PWsQvjp513qSX4Gbs=.a3115e6e-af16-4b82-a03f-b0e766fecabc@github.com> Message-ID: On Tue, 29 Jul 2025 06:55:42 GMT, Marc Chevalier wrote: >> test/hotspot/jtreg/compiler/igvn/CmpDisjointButNonOrderedRanges2.java line 30: >> >>> 28: * Comparing such values in such range with != should always be true. >>> 29: * @modules java.base/jdk.internal.util >>> 30: * @run main/othervm -Xbatch >> >> Since these two new tests use specific JVM options, do these tests needed '@requires vm.flagless' directive. > > I'm not very sure how that works, so I'm not sure... But I don't think so, also from looking at other examples with many flags, of a similar kind, without this directive. I think I could use more explanations of how that works in details. After gathering more info, I think we don't need flagless, and so we should avoid it. The reason is that despite needing some flags to reproduce the crash (at least, reliably), the test should not crash at all, even with more flags. It might become more or less interesting, but it should still not crash. In this meaning, the test is still correct even with more flags, so flagless is not needed. And who knows, maybe sprinkling additional flags on this test will eventually uncover another issue! ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26504#discussion_r2241853532 From mchevalier at openjdk.org Thu Jul 31 11:14:52 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 31 Jul 2025 11:14:52 GMT Subject: RFR: 8360561: PhaseIdealLoop::create_new_if_for_predicate hits "must be a uct if pattern" assert In-Reply-To: <1zdDyoDt1yJzlNEY8VziwbjNt7PWsQvjp513qSX4Gbs=.a3115e6e-af16-4b82-a03f-b0e766fecabc@github.com> References: <1zdDyoDt1yJzlNEY8VziwbjNt7PWsQvjp513qSX4Gbs=.a3115e6e-af16-4b82-a03f-b0e766fecabc@github.com> Message-ID: On Mon, 28 Jul 2025 14:40:03 GMT, SendaoYan wrote: >> Did you know that ranges can be disjoints and yet not ordered?! Well, in modular arithmetic. >> >> Let's look at a simplistic example: >> >> int x; >> if (?) { >> x = -1; >> } else { >> x = 1; >> } >> >> if (x != 0) { >> return; >> } >> // Unreachable >> >> >> With signed ranges, before the second `if`, `x` is in `[-1, 1]`. Which is enough to enter to second if, but not enough to prove you have to enter it: it wrongly seems that after the second `if` is still reachable. Twaddle! >> >> With unsigned ranges, at this point `x` is in `[1, 2^32-1]`, and then, it is clear that `x != 0`. This information is used to refine the value of `x` in the (missing) else-branch, and so, after the if. This is done with simple lattice meet (Hotspot's join): in the else-branch, the possible values of `x` are the meet of what is was worth before, and the interval in the guard, that is `[0, 0]`. Thanks to the unsigned range, this is known to be empty (that is bottom, or Hotspot's top). And with a little reduced product, the whole type of `x` is empty as well. Yet, this information is not used to kill control yet. >> >> This is here the center of the problem: we have a situation such as: >> 2 after-CastII >> After node `110 CastII` is idealized, it is found to be Top, and then the uncommon trap at `129` is replaced by `238 Halt` by being value-dead. >> 1 before-CastII >> Since the control is not killed, the node stay there, eventually making some predicate-related assert fail as a trap is expected under a `ParsePredicate`. >> >> And that's what this change proposes: when comparing integers with non-ordered ranges, let's see if the unsigned ranges overlap, by computing the meet. If the intersection is empty, then the values can't be equals, without being able to order them. This is new! Without unsigned information for signed integer, either they overlap, or we can order them. Adding modular arithmetic allows to have non-overlapping ranges that are also not ordered. >> >> Let's also notice that 0 is special: it is important bounds are on each side of 0 (or 2^31, the other discontinuity). For instance if `x` can be 1 or 5, for instance, both the signed and unsigned range will agree on `[1, 5]` and not be able to prove it's, let's say, 3. > ... > > test/hotspot/jtreg/compiler/igvn/CmpDisjointButNonOrderedRanges2.java line 30: > >> 28: * Comparing such values in such range with != should always be true. >> 29: * @modules java.base/jdk.internal.util >> 30: * @run main/othervm -Xbatch > > Since these two new tests use specific JVM options, do these tests needed '@requires vm.flagless' directive. I'm not very sure how that works, so I'm not sure... But I don't think so, also from looking at other examples with many flags, of a similar kind, without this directive. I think I could use more explanations of how that works in details. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26504#discussion_r2238724866 From mhaessig at openjdk.org Thu Jul 31 12:15:02 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 31 Jul 2025 12:15:02 GMT Subject: RFR: 8364409: [BACKOUT] Consolidate Identity of self-inverse operations In-Reply-To: References: Message-ID: On Thu, 31 Jul 2025 09:36:20 GMT, Manuel H?ssig wrote: > This reverts commit 66b5dba (review #23851). Unfortunately, it does not back out cleanly due to a `ReverseBytesNode` base class introduced in #24382, which was easily resolved by using the new base class. > > Testing: > - [ ] Github Actions > - [x] tier1 - tier3 plus some internal testing Thank you all for your reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26570#issuecomment-3139717341 From mhaessig at openjdk.org Thu Jul 31 12:15:03 2025 From: mhaessig at openjdk.org (Manuel =?UTF-8?B?SMOkc3NpZw==?=) Date: Thu, 31 Jul 2025 12:15:03 GMT Subject: Integrated: 8364409: [BACKOUT] Consolidate Identity of self-inverse operations In-Reply-To: References: Message-ID: On Thu, 31 Jul 2025 09:36:20 GMT, Manuel H?ssig wrote: > This reverts commit 66b5dba (review #23851). Unfortunately, it does not back out cleanly due to a `ReverseBytesNode` base class introduced in #24382, which was easily resolved by using the new base class. > > Testing: > - [ ] Github Actions > - [x] tier1 - tier3 plus some internal testing This pull request has now been integrated. Changeset: ddb64836 Author: Manuel H?ssig URL: https://git.openjdk.org/jdk/commit/ddb64836e5bafededb705329137e353f8c74dd5d Stats: 245 lines in 4 files changed: 8 ins; 225 del; 12 mod 8364409: [BACKOUT] Consolidate Identity of self-inverse operations Reviewed-by: thartmann, bmaillard, hgreule ------------- PR: https://git.openjdk.org/jdk/pull/26570 From epeter at openjdk.org Thu Jul 31 12:52:59 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 31 Jul 2025 12:52:59 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F In-Reply-To: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> Message-ID: <8Yc5bUUeWRBGV2XrdVi9uPVfXlMGaZ_fj-H6IjdWO_4=.c3bd36cf-86be-4926-a14d-9046d6bc862d@github.com> On Thu, 24 Jul 2025 10:29:15 GMT, Galder Zamarre?o wrote: > I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. > > Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: > > > Benchmark (seed) (size) Mode Cnt Base Patch Units Diff > VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% > VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% > VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% > VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% > VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% > VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% > > > The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. > > I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. Thanks for working on this, it looks really good :) I'll need to do some testing later though. test/micro/org/openjdk/bench/java/lang/VectorBitConversion.java line 1: > 1: package org.openjdk.bench.java.lang; I think this benchmark belongs with the other vectorization benchmarks under `test/micro/org/openjdk/bench/vm/compiler/Vector*` ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26457#pullrequestreview-3075276901 PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2245295082 From epeter at openjdk.org Thu Jul 31 12:53:00 2025 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 31 Jul 2025 12:53:00 GMT Subject: RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F In-Reply-To: <8Yc5bUUeWRBGV2XrdVi9uPVfXlMGaZ_fj-H6IjdWO_4=.c3bd36cf-86be-4926-a14d-9046d6bc862d@github.com> References: <0bYYOS5AYvN4ZD1xAGBRqV_xasw-np3JWKXC7WcGhyc=.74d97456-f406-4dbe-be09-77ed3b9a66fd@github.com> <8Yc5bUUeWRBGV2XrdVi9uPVfXlMGaZ_fj-H6IjdWO_4=.c3bd36cf-86be-4926-a14d-9046d6bc862d@github.com> Message-ID: <0by7e3QZ36BPsVlRJMxdgHKX-IbsbumQrFg-iBaSZBY=.ddfa469e-8718-4c62-aaaf-766c5d7f6473@github.com> On Thu, 31 Jul 2025 12:49:16 GMT, Emanuel Peter wrote: >> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations. >> >> Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows: >> >> >> Benchmark (seed) (size) Mode Cnt Base Patch Units Diff >> VectorBitConversion.doubleToLongBits 0 2048 thrpt 8 1168.782 1157.717 ops/ms -1% >> VectorBitConversion.doubleToRawLongBits 0 2048 thrpt 8 3999.387 7353.936 ops/ms +83% >> VectorBitConversion.floatToIntBits 0 2048 thrpt 8 1200.338 1188.206 ops/ms -1% >> VectorBitConversion.floatToRawIntBits 0 2048 thrpt 8 4058.248 14792.474 ops/ms +264% >> VectorBitConversion.intBitsToFloat 0 2048 thrpt 8 3050.313 14984.246 ops/ms +391% >> VectorBitConversion.longBitsToDouble 0 2048 thrpt 8 3022.691 7379.360 ops/ms +144% >> >> >> The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control. >> >> I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions. > > test/micro/org/openjdk/bench/java/lang/VectorBitConversion.java line 1: > >> 1: package org.openjdk.bench.java.lang; > > I think this benchmark belongs with the other vectorization benchmarks under > `test/micro/org/openjdk/bench/vm/compiler/Vector*` It's not really a language feature, more for vm/compiler ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2245296534 From bmaillard at openjdk.org Thu Jul 31 13:10:57 2025 From: bmaillard at openjdk.org (=?UTF-8?B?QmVub8OudA==?= Maillard) Date: Thu, 31 Jul 2025 13:10:57 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v2] In-Reply-To: <1A8oR7hEgev2U_ys1H_AVJS5kjw6LWoPgrVPhJXSFqI=.34cbd04b-bf88-441f-9c3d-97f9aee7f3c3@github.com> References: <1A8oR7hEgev2U_ys1H_AVJS5kjw6LWoPgrVPhJXSFqI=.34cbd04b-bf88-441f-9c3d-97f9aee7f3c3@github.com> Message-ID: <7KtjKex3ik1mzDGUP0J7vI0bdzWz-OaqprBbXlbhbE0=.7299f51f-4316-4371-bca2-846ae5bc6671@github.com> On Wed, 30 Jul 2025 08:27:37 GMT, Marc Chevalier wrote: >> Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash. >> >> Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN. >> >> This feature is enabled with the develop flag `VerifyIdealStructuralInvariants`. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as `Node::dump` or `Node::Name`. >> >> For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a `If` node, we have a `IfTrue` and a `IfFalse`. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds. >> >> On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs: >> >> 1 failure for node >> 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 >> At node >> 209 CountedLoopEnd === 182 208 [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100) >> From path: >> [center] 211 OuterStripMinedLoopEnd === 215 39 [[ 212 198 ]] P=0,948966, C=23799,000000 >> <-(0)- 215 SafePoint === 210 1 7 1 1 216 37 54 185 [[ 211 ]] SafePoint !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100) >> <-(0)- 210 IfFalse === 209 [[ 21... > > Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: > > Rename flag as suggested Great work, and great explanation as well! The invariants that are already implemented seem quite useful already, and it seems there is a lot of potential. Having recently worked on a few missed optimizations related to `PhaseIterGVN::add_users_of_use_to_worklist`, I agree that it would be interesting to use such patterns for automatic notifications. The way I see it, we would need to somehow "reverse" the patterns, as they would be expressed from the point of view of the node on which the optimizations is applied, and would require notification when dependencies changes. Probably quite non-trivial, but interesting nonetheless. I only have a few basic remarks/questions. src/hotspot/share/opto/graphInvariants.cpp line 181: > 179: AtInput(uint which_input, const Pattern* pattern) : _which_input(which_input), _pattern(pattern) {} > 180: bool check(const Node* center, Node_List& steps, GrowableArray& path, stringStream& ss) const override { > 181: assert(_which_input < center->req(), "First check the input number"); This really a detail, but I would use something more explicit: Suggestion: assert(_which_input < center->req(), "Input number is out of range"); src/hotspot/share/opto/graphInvariants.cpp line 197: > 195: }; > 196: > 197: struct HasType : Pattern { Could we make it slightly more general and accept any predicate on the type? From a previous PR that I worked on I remember that for example for `ModINode` if it has no control input then its divisor input should never be `0`. Maybe this is the kind of properties we could check in the future. This is just a random idea, feel free to ignore. src/hotspot/share/opto/graphInvariants.hpp line 32: > 30: > 31: // An invariant that needs only a local view of the graph, around a given node. > 32: class LocalGraphInvariant : public ResourceObj { Can't we put the whole definition behind a `#ifndef PRODUCT` check? It seems there are other instances where it is done with classes, such as `VTrace`. Or is there a reason not to? ------------- PR Review: https://git.openjdk.org/jdk/pull/26362#pullrequestreview-3074918758 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2245262469 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2245282832 PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2245044927 From mchevalier at openjdk.org Thu Jul 31 13:30:55 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 31 Jul 2025 13:30:55 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v2] In-Reply-To: <7KtjKex3ik1mzDGUP0J7vI0bdzWz-OaqprBbXlbhbE0=.7299f51f-4316-4371-bca2-846ae5bc6671@github.com> References: <1A8oR7hEgev2U_ys1H_AVJS5kjw6LWoPgrVPhJXSFqI=.34cbd04b-bf88-441f-9c3d-97f9aee7f3c3@github.com> <7KtjKex3ik1mzDGUP0J7vI0bdzWz-OaqprBbXlbhbE0=.7299f51f-4316-4371-bca2-846ae5bc6671@github.com> Message-ID: On Thu, 31 Jul 2025 12:44:02 GMT, Beno?t Maillard wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename flag as suggested > > src/hotspot/share/opto/graphInvariants.cpp line 197: > >> 195: }; >> 196: >> 197: struct HasType : Pattern { > > Could we make it slightly more general and accept any predicate on the type? From a previous PR that I worked on I remember that for example for `ModINode` if it has no control input then its divisor input should never be `0`. Maybe this is the kind of properties we could check in the future. This is just a random idea, feel free to ignore. I think there is a misunderstanding here. I'm talking about node type, as in which C++ class is it, not type as abstract values for nodes. I could rename this struct then. Maybe HasNodeType? `NodeClass` one could see `NodeClass(&Node::is_Region)` that reads almost as "node class is Region". Open to ideas... Also, in theory, it accepts any method of `Node` of type `bool()`. This could be used for something else. The idea was to make easy to say "I want a Node of type `IfNode` here". It's not that great to do with Opcode because of derived classes. I also considered something that would take any `Node -> bool` function, but that made the simple case harder. Instead of `HasType(&Node::is_If)`, I would have had to write something like `HasType([](const Node& n) { return n.is_If(); })`. Functional programming is possible in C++, but not quite syntactically elegant, and I think readability here is important. If such a need arises, I suggest to add a `UnaryPredicate` (or `NodePredicate` etc.) to do that. If the predicates are complicated enough, the bit of symbols needed for making a lambda doesn't matter so much. As for your case, yes, we can add that in the future. It could be done with the UnaryPredicate I describe above, or with a more specific pattern that would work on types, and take a method `bool(Type::))()` or a function `bool(const Type&)` and the pattern would take care of finding the type and submitting it to the predicate. Not that it's a lot of work, but it allows to communicate more clearly the intention, in my opinion. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2245405917 From mchevalier at openjdk.org Thu Jul 31 13:35:58 2025 From: mchevalier at openjdk.org (Marc Chevalier) Date: Thu, 31 Jul 2025 13:35:58 GMT Subject: RFR: 8350864: C2: verify structural invariants of the Ideal graph [v2] In-Reply-To: <7KtjKex3ik1mzDGUP0J7vI0bdzWz-OaqprBbXlbhbE0=.7299f51f-4316-4371-bca2-846ae5bc6671@github.com> References: <1A8oR7hEgev2U_ys1H_AVJS5kjw6LWoPgrVPhJXSFqI=.34cbd04b-bf88-441f-9c3d-97f9aee7f3c3@github.com> <7KtjKex3ik1mzDGUP0J7vI0bdzWz-OaqprBbXlbhbE0=.7299f51f-4316-4371-bca2-846ae5bc6671@github.com> Message-ID: <1ywMAXL2kqH3OohOP8GFEnqoUei-X2DNoYi4E2Z9m1E=.015d1d30-f63d-41b3-80d2-248bcd76890c@github.com> On Thu, 31 Jul 2025 10:51:58 GMT, Beno?t Maillard wrote: >> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename flag as suggested > > src/hotspot/share/opto/graphInvariants.hpp line 32: > >> 30: >> 31: // An invariant that needs only a local view of the graph, around a given node. >> 32: class LocalGraphInvariant : public ResourceObj { > > Can't we put the whole definition behind a `#ifndef PRODUCT` check? It seems there are other instances where it is done with classes, such as `VTrace`. Or is there a reason not to? I had a reason, but disclaimer first, I'm not saying it's a good one. My idea is that these checks shouldn't be in product, and the flag to enable them is not either. But I also thought it wouldn't be crazy to have this flag as diagnostic one day, or maybe only a diagnostic flag in addition for a subset of checks that we could find useful to have a diagnostic flag for, but that would use the same underlying mechanics. That being said, it's not going to be the case in a near future, and it's not hard to change a way or the other. So I guess I can put everything behind a `#ifndef PRODUCT` and if we ever regret, the additional work would be really small. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26362#discussion_r2245418879 From fgao at openjdk.org Thu Jul 31 13:57:57 2025 From: fgao at openjdk.org (Fei Gao) Date: Thu, 31 Jul 2025 13:57:57 GMT Subject: RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v3] In-Reply-To: References: Message-ID: <1XFXtkTlDshGtoxEdLVg0f2J2rtn4wz7CdUB9pb9N2g=.25e7e0b5-8468-4d91-adb9-c459bda40933@github.com> On Fri, 25 Jul 2025 03:26:36 GMT, Xiaohong Gong wrote: >> This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform. >> >> ### Background >> Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for `byte`/`short` types using `int` vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. `int` elements). Hence, the total size is `32 * elem_num` bits, where `elem_num` is the number of loaded elements in the vector register. >> >> ### Implementation >> >> #### Challenges >> Due to size differences between `int` indices (32-bit) and `byte`/`short` data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints. >> >> For a 512-bit SVE machine, loading a `byte` vector with different vector species require different approaches: >> - SPECIES_64: Single operation with mask (8 elements, 256-bit) >> - SPECIES_128: Single operation, full register (16 elements, 512-bit) >> - SPECIES_256: Two operations + merge (32 elements, 1024-bit) >> - SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit) >> >> Use `ByteVector.SPECIES_512` as an example: >> - It contains 64 elements. So the index vector size should be `64 * 32` bits, which is 4 times of the SVE vector register size. >> - It requires 4 times of vector gather-loads to finish the whole operation. >> >> >> byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...] >> int[] idx = [0, 1, 2, 3, ..., 63, ...] >> >> 4 gather-load: >> idx_v1 = [15 14 13 ... 1 0] gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa] >> idx_v2 = [31 30 29 ... 17 16] gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb] >> idx_v3 = [47 46 45 ... 33 32] gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc] >> idx_v4 = [63 62 61 ... 49 48] gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd] >> merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa] >> >> >> #### Solution >> The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end. >> >> Here is the main changes: >> - Enhanced IR generation with architecture-specific patterns based on `gather_scatter_needs_vector_index()` matcher. >> - Added `VectorSliceNode` for result mer... > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Refine IR pattern and clean backend rules > I've submitted a test on a 256-bit sve machine. I'll get back to you once it?s finished. The new commit passed tier1 - tier3 on 256-bit `sve` machine without new failures. Thanks! src/hotspot/cpu/arm/matcher_arm.hpp line 160: > 158: static const bool supports_encode_ascii_array = false; > 159: > 160: // Return true if vector gather-load/scatter-store needs vector index as input. If the function returns `false`, does it indicate one of the following cases? - Vector gather-load or scatter-store does not accept a vector index for the current use case on this platform. - The current platform does not support vector gather-load or scatter-store at all. ------------- PR Review: https://git.openjdk.org/jdk/pull/26236#pullrequestreview-3075415497 PR Review Comment: https://git.openjdk.org/jdk/pull/26236#discussion_r2245383326 From chagedorn at openjdk.org Thu Jul 31 15:31:56 2025 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Thu, 31 Jul 2025 15:31:56 GMT Subject: RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v8] In-Reply-To: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> References: <-otlKVhe_xfmpET_cwn5CdvzDduOfFApGSH5VoZSwuk=.7eb8a0e3-4ad6-4ffb-97fd-11a2120a3eaf@github.com> Message-ID: On Wed, 30 Jul 2025 06:14:40 GMT, erifan wrote: >> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is >> relative smaller than that of `fromLong`. So this patch does the conversion for these cases. >> >> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported. >> >> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change. >> >> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64. >> >> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like >> >> VectorMaskToLong (VectorLongToMask x) => x >> >> >> Hence, this patch also added the following optimizations: >> >> VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen))) // x is -1 or 0 >> >> VectorMaskCast (VectorMaskCast x) => x >> >> And we can see noticeable performance improvement with the above optimizations for floating-point types. >> >> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`: >> >> Benchmark Unit Before Error After Error Uplift >> microMaskFromLongToLong_Double128 ops/s 1522384.986 1324881.46 2835774480 403575069.7 1862.71 >> microMaskFromLongToLong_Double256 ops/s 4275.415598 28.560622 4285.587451 27.633101 1 >> microMaskFromLongToLong_Double512 ops/s 3702.171936 9.528497 3692.747579 18.47744 0.99 >> microMaskFromLongToLong_Double64 ops/s 4624.452243 37.388427 4616.320519 23.455954 0.99 >> microMaskFromLongToLong_Float128 ops/s 1239661.887 1286803.852 2842927993 360468218.3 2293.3 >> microMaskFromLongToLong_Float256 ops/s 3681.64954 15.153633 3685.411771 21.737124 1 >> microMaskFromLongToLong_Float512 ops/s 3007.563025 10.189944 3022.002986 14.137287 1 >> microMaskFromLongToLong_Float64 ops/s 1646664.258 1375451.279 2948453900 397472562.4 1790.56 >> >> >> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`: >> >> Benchm... > > erifan has updated the pull request incrementally with one additional commit since the last revision: > > Set default warm up to 10000 for JTReg tests Testing looked good! ------------- PR Comment: https://git.openjdk.org/jdk/pull/25793#issuecomment-3140411816 From bkilambi at openjdk.org Thu Jul 31 16:22:04 2025 From: bkilambi at openjdk.org (Bhavana Kilambi) Date: Thu, 31 Jul 2025 16:22:04 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v18] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 09:17:19 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in the ad file Hi @sviswa7 there's some x86 code in this patch which I would like an x86 expert to review. Would you be able to take a look please? It's not a big change. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-3140565389 From sviswanathan at openjdk.org Thu Jul 31 18:52:02 2025 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Thu, 31 Jul 2025 18:52:02 GMT Subject: RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v18] In-Reply-To: References: Message-ID: On Fri, 25 Jul 2025 09:17:19 GMT, Bhavana Kilambi wrote: >> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI. >> >> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2. >> >> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2. >> >> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation. >> >> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor. >> >> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below - >> >> >> Benchmark (size) Mode Cnt Gain >> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43 >> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48 >> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55 >> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07 >> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69 >> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50 >> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52 >> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38 >> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93 >> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48 >> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49 >> >> >> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander. > > Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > Refine comments in the ad file x86 changes look good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-3076555656 From dlong at openjdk.org Thu Jul 31 20:08:00 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 31 Jul 2025 20:08:00 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" [v4] In-Reply-To: References: Message-ID: On Wed, 30 Jul 2025 22:58:50 GMT, Guanqiang Han wrote: >> I'm able to consistently reproduce the problem using the following command line and test program ? >> >> java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java >> >> import java.util.Arrays; >> public class Test{ >> public static void main(String[] args) { >> System.out.println("begin"); >> byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; >> byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; >> System.out.println(Arrays.equals(arr1, arr2)); >> System.out.println("end"); >> } >> } >> >> From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). >> >> In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch >> Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. >> >> In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. >> >> Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. >> >> A reference to the relevant code paths is provided below : >> image1 >> image2 >> >> On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. >> >> However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size class... > > Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - change T_LONG to T_ADDRESS in some intrinsic functions > - Merge remote-tracking branch 'upstream/master' into 8359235 > - Increase sleep time to ensure the method gets compiled > - add regression test > - Merge remote-tracking branch 'upstream/master' into 8359235 > - 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" Looks good now. Testing in progress... ------------- Marked as reviewed by dlong (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/26462#pullrequestreview-3076741459 From never at openjdk.org Thu Jul 31 20:51:59 2025 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 31 Jul 2025 20:51:59 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v7] In-Reply-To: References: Message-ID: <2hi_TI_U0T4NwVPEG7LIMjDD3xvEYx-q-BBJGIGQikA=.82f2b333-7084-4e53-937f-2c014b37635d@github.com> On Thu, 24 Jul 2025 20:03:33 GMT, Dean Long wrote: >> The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > readability suggestion src/hotspot/share/runtime/deoptimization.cpp line 940: > 938: int callee_size_of_parameters = 0; > 939: for (int frame_idx = 0; frame_idx < cur_array->frames(); frame_idx++) { > 940: assert(is_top_frame == (frame_idx == 0), "must be"); Why not replace this with direct computation of the value: bool is_top_frame = (frame_idx == 0); then you don't even need the final reset of the value either. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2246351521 From never at openjdk.org Thu Jul 31 21:04:56 2025 From: never at openjdk.org (Tom Rodriguez) Date: Thu, 31 Jul 2025 21:04:56 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v7] In-Reply-To: References: Message-ID: On Thu, 24 Jul 2025 20:03:33 GMT, Dean Long wrote: >> The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > readability suggestion src/hotspot/share/runtime/deoptimization.cpp line 971: > 969: > 970: cur_code = str.next(); > 971: reexecute = true; This seems a little unsavory, particularly since there's a later step which will print that value as if it was the original one. Since there's only one later logic use of the variable maybe there's should be a new flag to mark special case? Like `rolled_forward`? It might be fine as is with comments explaining this and fixing the printing to reflect what occurred here. src/hotspot/share/runtime/deoptimization.cpp line 995: > 993: int map_expr_invoke_ssize = mask.expression_stack_size() + cur_invoke_parameter_size; > 994: int expr_ssize_before = iframe_expr_ssize + (is_top_frame ? top_frame_expression_stack_adjustment : 0); > 995: int map_expr_callee_ssize = mask.expression_stack_size() + callee_size_of_parameters; `map` in these names might be more clearly `oopmap`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2246372241 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2246372513 From dlong at openjdk.org Thu Jul 31 22:19:55 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 31 Jul 2025 22:19:55 GMT Subject: RFR: 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" [v4] In-Reply-To: References: Message-ID: On Wed, 30 Jul 2025 22:58:50 GMT, Guanqiang Han wrote: >> I'm able to consistently reproduce the problem using the following command line and test program ? >> >> java -Xcomp -XX:TieredStopAtLevel=1 -XX:C1MaxInlineSize=200 Test.java >> >> import java.util.Arrays; >> public class Test{ >> public static void main(String[] args) { >> System.out.println("begin"); >> byte[] arr1 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; >> byte[] arr2 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; >> System.out.println(Arrays.equals(arr1, arr2)); >> System.out.println("end"); >> } >> } >> >> From my analysis, the root cause appears to be a mismatch in operand handling between T_ADDRESS and T_LONG in LIR_Assembler::stack2reg, especially when the source is marked as double stack (e.g., T_LONG) and the destination as single CPU register (e.g., T_ADDRESS), leading to assertion failures like assert(is_single_stack())(because T_LONG is double_size). >> >> In the test program above , the call chain is: Arrays.equals ? ArraysSupport.vectorizedMismatch ? LIRGenerator::do_vectorizedMismatch >> Within the do_vectorizedMismatch() method, a move instruction constructs an LIR_Op1. During LIR to machine code generation, LIR_Assembler::stack2reg was called. >> >> In this case, the src operand has type T_LONG and the dst operand has type T_ADDRESS. This combination triggers an assert in stack2reg, due to a mismatch between the stack slot type and register type handling. >> >> Importantly, this path ( LIR_Assembler::stack2reg was called ) is only taken when src is forced onto the stack. To reliably trigger this condition, the test is run with the -Xcomp option to force compilation and increase register pressure. >> >> A reference to the relevant code paths is provided below : >> image1 >> image2 >> >> On 64-bit platforms, although T_ADDRESS is classified as single_size, it is in fact 64 bits wide ,represent a single 64-bit general-purpose register and it can hold a T_LONG value, which is also 64 bits. >> >> However, T_LONG is defined as double_size, requiring two local variable slots or a pair of registers in the JVM's abstract model. This mismatch stems from the fact that T_ADDRESS is platform-dependent: it's 32 bits on 32-bit platforms, and 64 bits on 64-bit platforms ? yet its size class... > > Guanqiang Han has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - change T_LONG to T_ADDRESS in some intrinsic functions > - Merge remote-tracking branch 'upstream/master' into 8359235 > - Increase sleep time to ensure the method gets compiled > - add regression test > - Merge remote-tracking branch 'upstream/master' into 8359235 > - 8359235: C1 compilation fails with "assert(is_single_stack() && !is_virtual()) failed: type check" Testing results are good. You need one more review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/26462#issuecomment-3141506661 From dlong at openjdk.org Thu Jul 31 22:33:29 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 31 Jul 2025 22:33:29 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v8] In-Reply-To: References: Message-ID: > The VerifyStack logic in Deoptimization::unpack_frames() attempts to check the expression stack size of the interpreter frame against what GenerateOopMap computes. To do this, it needs to know if the state at the current bci represents the "before" state, meaning the bytecode will be reexecuted, or the "after" state, meaning we will advance to the next bytecode. The old code didn't know how to determine exactly what state we were in, so it checked both. This PR cleans that up, so we only have to compute the oopmap once. It also removes old SPARC support. Dean Long has updated the pull request incrementally with two additional commits since the last revision: - more cleanup - simplify is_top_frame ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26121/files - new: https://git.openjdk.org/jdk/pull/26121/files/535fbb05..6257de6c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26121&range=06-07 Stats: 15 lines in 1 file changed: 6 ins; 2 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/26121.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26121/head:pull/26121 PR: https://git.openjdk.org/jdk/pull/26121 From dlong at openjdk.org Thu Jul 31 22:33:29 2025 From: dlong at openjdk.org (Dean Long) Date: Thu, 31 Jul 2025 22:33:29 GMT Subject: RFR: 8278874: tighten VerifyStack constraints [v7] In-Reply-To: References: Message-ID: On Thu, 31 Jul 2025 21:02:01 GMT, Tom Rodriguez wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> readability suggestion > > src/hotspot/share/runtime/deoptimization.cpp line 971: > >> 969: >> 970: cur_code = str.next(); >> 971: reexecute = true; > > This seems a little unsavory, particularly since there's a later step which will print that value as if it was the original one. Since there's only one later logic use of the variable maybe there's should be a new flag to mark special case? Like `rolled_forward`? It might be fine as is with comments explaining this and fixing the printing to reflect what occurred here. OK, I cleaned this up a bit. I think this code could be cleaned up further and use fewer variables, but I'd like to save that for another day. > src/hotspot/share/runtime/deoptimization.cpp line 995: > >> 993: int map_expr_invoke_ssize = mask.expression_stack_size() + cur_invoke_parameter_size; >> 994: int expr_ssize_before = iframe_expr_ssize + (is_top_frame ? top_frame_expression_stack_adjustment : 0); >> 995: int map_expr_callee_ssize = mask.expression_stack_size() + callee_size_of_parameters; > > `map` in these names might be more clearly `oopmap`. Done. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2246507655 PR Review Comment: https://git.openjdk.org/jdk/pull/26121#discussion_r2246507871 From chen.l.liang at oracle.com Thu Jul 31 23:29:53 2025 From: chen.l.liang at oracle.com (Chen Liang) Date: Thu, 31 Jul 2025 23:29:53 +0000 Subject: =?gb2312?B?UmU6ILvYuLSjulJldXNlIHRoZSBTdHJpbmdVVEYxNjo6cHV0Q2hhcnNTQiBt?= =?gb2312?B?ZXRob2QgaW5zdGVhZCBvZiB0aGUgSW50cmluc2ljIGluIHRoZSBTdHJpbmdV?= =?gb2312?Q?TF16::toBytes?= In-Reply-To: References: <086fd3c9-30e0-4294-b674-ece0bd91051c.shaojin.wensj@alibaba-inc.com> , <9b1f2b83-2e0e-443f-a1af-307bcf871974@oracle.com> Message-ID: Hi all, I think the key takeaway here is that we should reduce the number of intrinsics for easier maintenance. With the same number of unsafe Java methods, it is still feasible to reduce the number of distinct intrinsics simply for the reduced maintenance cost. For example, the toBytes and getChars of StringUTF16 both have intrinsics. However, in essence, they are just two array copy functions - it would be more reasonable for hotspot to implement a generic array copy intrinsic that StringUTF16 can use instead. Such an intrinsic may take place on Unsafe.copyMemory itself, or may be somewhere else. Chen ________________________________ From: core-libs-dev on behalf of wenshao Sent: Wednesday, July 30, 2025 9:45 PM To: Roger Riggs ; core-libs-dev Subject: ???Reuse the StringUTF16::putCharsSB method instead of the Intrinsic in the StringUTF16::toBytes Thanks to Roger Riggs for suggesting that the code should not be called with Unsafe.uninitializedArray. After replacing it with `new byte[]` and running `StringConstructor.newStringFromCharsMixedBegin`, I verified that performance remained consistent on x64. On aarch64, performance improved by 8% for size = 7, but decreased by 7% for size = 64. For detailed performance data, see the Markdown data in the draft pull request I submitted.https://github.com/openjdk/jdk/pull/26553#issuecomment-3138357748 - Shaojin Wen ------------------------------------------------------------------ ????Roger Riggs ?????2025?7?31?(??) 03:17 ????"core-libs-dev" ????Re: Reuse the StringUTF16::putCharsSB method instead of the Intrinsic in the StringUTF16::toBytes Hi, Unsafe.uninitializedArray and StringConcatHelper.newArray was created for the exclusive use of StringConcatHelper and by HotSpot optimizations. Unsafe.uninitializedArray and StringConcatHelper.newArray area very sensitive APIs and should NOT be used anywhere except in StringConcatHelper and HotSpot. Regards, Roger On 7/30/25 11:40 AM, jaikiran.pai at oracle.com wrote: I'll let others knowledgeable in this area to comment and provide inputs to this proposal. I just want to say thank you for bringing up this discussion to the mailing list first, providing the necessary context and explanation and seeking feedback, before creating a JBS issue or a RFR PR. -Jaikiran On 30/07/25 7:48 pm, wenshao wrote: In the discussion of `8355177: Speed up StringBuilder::append(char[]) via Unsafe::copyMemory` (https://github.com/openjdk/jdk/pull/24773), @liach (Chen Liang) suggested reusing the StringUTF16::putCharsSB method introduced in PR #24773 instead of the Intrinsic implementation in the StringUTF16::toBytes method. Original: ```java @IntrinsicCandidate public static byte[] toBytes(char[] value, int off, int len) { byte[] val = newBytesFor(len); for (int i = 0; i < len; i++) { putChar(val, i, value[off]); off++; } return val; } ``` After: ```java public static byte[] toBytes(char[] value, int off, int len) { byte[] val = (byte[]) Unsafe.getUnsafe().allocateUninitializedArray(byte.class, newBytesLength(len)); putCharsSB(val, 0, value, off, off + len); return val; } ``` This replacement does not degrade performance. Running StringConstructor.newStringFromCharsMixedBegin verified that performance is consistent with the original on x64 and slightly improved on aarch64. The implementation after replacing the Intrinsic implementation removed 100 lines of C++ code, leaving only Java and Unsafe code, no Intrinsic or C++ code, which makes the code more maintainable. I've submitted a draft PR https://github.com/openjdk/jdk/pull/26553 , please give me some feedback. - Shaojin Wen -------------- next part -------------- An HTML attachment was scrubbed... URL: